L20) Inference

“Generation-Recognition Asymmetry” Series

This article accompanies the research paper The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory (📄 arXiv).
Reading path: L13 · L14 · L15 · L16 · L17 · L18 · L19 · L20 — or the complete index.

The Third Direction

When the Grammar Itself is Unknown

Generating assumes that we know the rules. Recognizing does too. But what happens when we only have examples — and we have to guess the rules? This third use of a grammar is the most difficult of the three. To the point that from positive examples alone, it is quite simply impossible.

Where Does This Article Fit In?

L13 established the triad: a grammar can be used to generate, to recognize, or to infer. The previous articles explored the first two directions. This one tackles the third — the least studied, the hardest — and shows that it is not a separate domain, but the extreme end of a continuum of which recognition occupies the other end. This is dimension D5 of the research paper that this series popularizes.

Why It Matters

Grammatical inference is the problem of acquisition: how a child learns their language, how a system discovers the structure of a corpus without being given it. In music, it is also the computational analyst’s dream: to extract the grammar of a style from a corpus of works (M10).

Yet this problem has a deceptive reputation. We imagine it as “difficult but feasible.” Theory says otherwise: in its purest framework, it is impossible — and the loopholes that make it practicable reveal exactly what must be known in advance in order to learn.

The Idea in One Sentence

Inference is not a separate operation but the extreme case of recognition when knowledge of the grammar drops to zero — a regime where Gold’s theorem forbids learning from positive examples alone, and where learning equates to compressing.

Explaining Step by Step

1. Three Uses, Not Two

Recall from L13:

Operation	Given	Sought
Generation	the grammar $$G$$	a string $$s$$
Recognition	$$G$$ + a string $$s$$	the structure of $$s$$
Inference	a corpus $\{s_1, \dots, s_k\}$	the grammar $$G$$

The difference is striking: in the first two, the grammar is known. In the third, it is precisely what we are looking for. Inference presupposes nothing — neither the grammar nor the structure. Everything is to be discovered.

2. Inference as the Extreme Case of Recognition

Rather than opposing recognition and inference, we can place them on the same axis: that of available grammatical knowledge, which we will denote as $\kappa$ (kappa).

TikZ diagram

Figure 1 — The knowledge gradient $\kappa$ . On the left ( $\kappa = 1$ ), the grammar is fully known: this is standard recognition. On the right ( $\kappa = 0$ ), there is no prior knowledge: this is pure inference. In between are intermediate regimes (set of candidate grammars, partial grammar with oracle). Gold’s wall marks a phase transition near $\kappa = 0$ : just before it, learning remains possible in principle; at $\kappa = 0$ , it becomes impossible for rich classes.

Inference is therefore not “something else” than recognition: it is what recognition becomes when all prior knowledge is removed. And this retreat is not gradual — it hits a wall.

3. Gold’s Theorem: Impossibility

In 1967, E. Mark Gold proved a foundational result, identification in the limit:

Theorem (Gold, 1967). Let there be a class of languages containing all finite languages and at least one infinite language (a so-called superfinite class). Then this class is not identifiable in the limit from positive data alone: no algorithm can guarantee convergence to the correct grammar by seeing only valid examples.

Deciphering — “identification in the limit.” The learner receives examples one by one, and proposes a grammar after each. They “identify in the limit” if, after a finite number of examples, they eventually propose the correct grammar and stop changing it. Gold proves that for interesting classes (which include regular languages and beyond), no learner can guarantee this.

Why? Because with positive examples alone, nothing ever prevents overgeneralization: if we have seen ab, aabb, aaabbb, should we conjecture $\{a^n b^n\}$ , or $\{a^m b^n\}$ , or “all strings”? Without counterexamples (strings outside the language), we cannot decide. This is the phase transition of Figure 1: a hint of prior knowledge makes learning possible; zero knowledge makes it impossible.

4. Why It Is Qualitatively Harder

The gap between recognition and inference is deeper than the one between generation and recognition. Two results quantify this.

Pitt & Warmuth (1993): the problem of finding the smallest deterministic finite automaton (DFA) consistent with the data is not only NP-hard — it is not approximable within any polynomial. Even finding an approximate solution is out of reach.

Kearns & Valiant (1994): learning DFAs in the PAC framework is as hard as breaking RSA encryption. In other words, an efficient learner would provide an efficient algorithm to break public-key cryptography — which is believed to be impossible. This is a cryptographic lower bound on the difficulty of inference.

Deciphering — PAC. Probably Approximately Correct: a learning framework where one requires, with high probability, an approximately correct hypothesis. This is the standard framework of learning theory.

Where parsing is polynomial for common grammars, inference hits walls of impossibility and cryptography. It is not “slower”: it is of a different nature.

5. The Loopholes — and What They Cost

Inference is therefore impossible within Gold’s framework. Yet we do learn languages, and machines do induce models. How? By changing the framework — and each change reveals what must be known in advance.

The probabilistic / Bayesian framework. If we learn stochastic grammars (PCFGs — see B1) with a prior distribution, this prior breaks the symmetry that Gold exploited: it penalizes overly general grammars. Learning becomes possible in principle (de la Higuera 2010, ch. 16), but the computational cost remains prohibitive for large grammars.
The membership oracle. Angluin’s (1987) L\ algorithm efficiently learns regular languages — but it requires an oracle* that answers “is this string in the language?”. Yet answering this question is… recognition. Efficient inference presupposes recognition. We are at the intermediate $\kappa$ point of Figure 1, not at $\kappa = 0$ .

Each loophole confirms the lesson: we do not learn from nothing. We need a bias (the prior) or a recognition capability (the oracle).

6. To Infer is to Compress

There is an elegant way to understand why inference is the inverse of generation. The MDL (Minimum Description Length, Rissanen 1978; Grünwald 2007) principle reformulates learning as compression: the best grammar is the one that minimizes the total description length = size of the grammar + size of the data encoded with it.

Gregory Chaitin states it bluntly:

« A theory is good to the extent that it compresses the data into a much smaller set of theoretical assumptions. The greater the compression, the better! »

— Chaitin (2005)

Hence a clear symmetry with L18:

Generating = decompressing: expanding a compact description (the grammar) into data (strings).
Inferring = compressing: finding the compact description that accounts for the data.

And compressing is notoriously more difficult than decompressing. The compression/decompression asymmetry is another manifestation of the same divide — the one running through this entire article.

7. In Music: The Ear Learning a Style

Musical grammatical inference is the inverse problem of composition by grammar. Instead of writing the rules of a style and generating, we start with a corpus (improvisations, a repertoire) and infer the rules. This is the subject of M10: Gold, Sequitur, MDL, COSIATEC, and Bernard Bel’s QAVAID system for analyzing bols.

Gold’s wall is concrete here: a corpus consists of positive examples (what has been played), rarely explicit counterexamples (what would be “wrong” in the style). Hence the difficulty of inferring a style grammar without a strong bias — exactly what theory predicts.

And the human ear? It manages because it is never at $\kappa = 0$ : it arrives with biases (preference for regularity, tonal expectations — see L19) and an implicit oracle (the feeling that “it sounds right” or “wrong”). The musical learner does not infer in a vacuum: they infer from a rich $\kappa > 0$ .

Key Takeaways

Inference is the third use of a grammar: guessing the rules from examples alone.
It is not a separate domain but the extreme case of recognition, at the end of the knowledge gradient $\kappa$ (from known grammar to tabula rasa).
Gold’s theorem (1967): from positive data alone, rich classes are not identifiable — a phase transition, not a gradual degradation.
It is qualitatively harder: non-approximable (Pitt & Warmuth), as difficult as breaking RSA (Kearns & Valiant).
The loopholes (Bayesian prior, L*’s membership oracle) confirm that we do not learn from nothing: we need a bias or a recognition capability.
Inferring = compressing (MDL); generating = decompressing. The same asymmetry, viewed from the perspective of information theory.
In music, Gold’s wall explains the difficulty of inducing a style grammar; the human ear escapes it because it is never at $\kappa = 0$ .

Glossary

Grammatical inference (induction): discovering a grammar from a corpus of examples, without it being given.
Identification in the limit: Gold’s criterion; the learner converges to the correct grammar after a finite number of examples and no longer changes it.
Superfinite class: a class of languages containing all finite languages and at least one infinite language; not identifiable from positive data alone.
Positive data: examples belonging to the language (without explicit counterexamples of what is excluded from it).
Knowledge gradient ( $\kappa$ ): the amount of available grammatical knowledge, from 1 (known grammar, recognition) to 0 (tabula rasa, inference).
PAC (Probably Approximately Correct): a learning framework requiring an approximately correct hypothesis with high probability.
Membership oracle: a procedure answering “is this string in the language?”; required by L*, thus presupposing recognition.
MDL (Minimum Description Length): the principle that the best model minimizes the total size (model + encoded data). Learning = compressing.

Links in the Series

L13 — Generate or recognize — the triad and dimension D5
M10 — Musical grammar induction — inference applied to music
L18 — The sign reversal — compression vs. decompression
L19 — The P-chain — learning through error prediction

Prerequisites: L13, L1
Reading time: 11 min
Tags: #grammatical-inference #Gold #MDL #compression #learning #knowledge-gradient

← Back to index