B1) PCFG

When Grammars Roll the Dice

What if a grammar could choose its rules randomly, but not just any way?

Where does this article fit in?

This article opens the BP3 series (Bol Processor, version 3 — see I2). It extends context-free grammars (L2) by adding probabilities — the fundamental mechanism that allows the Bol Processor to generate varied music while remaining controlled.

Sidebar: What is a PCFG?

PCFG stands for Probabilistic Context-Free Grammar. It is an extension of classic context-free grammars (CFG) where each production rule is associated with a probability. The term “context-free” means that the replacement of a symbol does not depend on its surroundings: A can be replaced by x y z whether A is at the beginning, middle, or end of the string.

Why is this important?

Imagine you are writing a grammar to generate sentences in French. You have several ways to form a noun phrase: “le chat” (the cat), “un petit chat noir” (a small black cat), “ce vieux matou de gouttière” (this old alley cat). Technically, all these forms are correct. But in real life, “le chat” is much more frequent than “ce vieux matou de gouttière”.

A classic context-free grammar (CFG) treats all alternatives equally — it doesn’t know that one phrasing is more probable than another. Probabilistic grammars (PCFG) correct this shortcoming by associating a probability with each rule.

Result: we can not only recognize if a sentence is grammatical, but also:

Calculate what is the probability that it will be produced
Choose the most likely analysis when several are possible
Generate sequences with realistic frequencies

In music, this idea makes perfect sense. A composer who improvises does not choose notes purely at random: certain progressions are more “natural” than others depending on the style. PCFGs make it possible to model these statistical tendencies.

The idea in one sentence

A PCFG is a context-free grammar where each production rule has a probability, and where the probabilities of all alternatives for the same symbol sum to 1.

Let’s explain step by step

The structure of a PCFG

A PCFG uses exactly the same structure as a CFG, with four components:

Non-terminal symbols (N): intermediate variables that will be replaced. Examples: S (sentence), NP (Noun Phrase), VP (Verb Phrase). By convention, they are written in uppercase.

Terminal symbols (T): the final elements that appear in the result. Examples: the words “le” (the), “chat” (cat), “mange” (eats). By convention, they are written in lowercase or enclosed in quotes.

Start symbol: the initial non-terminal from which all generation begins (usually S for Sentence).

Production rules: instructions of the form A → x which state “A can be replaced by x”.

The difference from a classic CFG? In a PCFG, each rule is annotated with a probability in square brackets [p], where p is a number between 0 and 1.

Example 1: A simple sentence grammar

S  → NP VP     [0.8]    # A sentence = noun phrase + verb phrase (80%)
S  → VP        [0.2]    # A sentence = just a verb phrase (20%)

NP → Det N     [0.6]    # Noun phrase = determiner + noun (60%)
NP → N         [0.4]    # Noun phrase = just a noun (40%)

VP → V NP      [0.7]    # Verb phrase = verb + complement (70%)
VP → V         [0.3]    # Verb phrase = just a verb (30%)

Det → le       [0.5]    # The determiner is "le" (50%)
Det → un       [0.5]    # The determiner is "un" (50%)

N  → chat      [0.6]    # The noun is "chat" (60%)
N  → chien     [0.4]    # The noun is "chien" (40%)

V  → mange     [0.5]    # The verb is "mange" (50%)
V  → dort      [0.5]    # The verb is "dort" (50%)

Sidebar: Reading a PCFG rule

The notation S → NP VP [0.8] reads as:

“S can be replaced by NP followed by VP”

“This rule has a probability of 0.8 (80%)”

It’s like a loaded die: if you have two ways to expand S, with probabilities 0.8 and 0.2, imagine a 10-sided die where 8 sides yield the first rule and 2 sides yield the second.

Normalization check:

For each non-terminal, the sum of the probabilities of its rules must equal 1:

S: 0.8 + 0.2 = 1.0
NP: 0.6 + 0.4 = 1.0
VP: 0.7 + 0.3 = 1.0
Det: 0.5 + 0.5 = 1.0
N: 0.6 + 0.4 = 1.0
V: 0.5 + 0.5 = 1.0

Calculating the probability of a sentence:

Let’s take “le chat dort” (the cat sleeps). A derivation is the sequence of rule applications that transforms the start symbol S into the final sentence. Here’s how to calculate its probability step by step:

Step 1: S → NP VP                  Probability of this rule: 0.8
Step 2: NP → Det N (in NP VP)      Probability of this rule: 0.6
Step 3: Det → le                   Probability of this rule: 0.5
Step 4: N → chat                   Probability of this rule: 0.6
Step 5: VP → V                     Probability of this rule: 0.3
Step 6: V → dort                   Probability of this rule: 0.5

The total probability is the product of all probabilities (as for independent events in probability theory):

P("le chat dort") = 0.8 × 0.6 × 0.5 × 0.6 × 0.3 × 0.5 = 0.0216

Sidebar: Why multiply?

Imagine flipping a coin multiple times. The probability of getting heads then tails then heads is: P(heads) × P(tails) × P(heads) = 0.5 × 0.5 × 0.5 = 0.125

It’s the same principle: each rule choice is independent, so we multiply the probabilities. The longer the derivation (more rules applied), the smaller the final probability.

This means there is approximately a 2.16% chance of generating exactly this sentence with this grammar. This may seem low, but it is consistent: this grammar can generate dozens of different sentences.

Example 2: A probabilistic tabla grammar

The Bol Processor was created to model tabla compositions — the two-drum instrument of Indian classical music [KippenBel1989]. Percussion syllables (bols) like dha, dhin, tin, ta are the natural terminals of these grammars. Here is a simplified PCFG inspired by the work of Bel and Kippen [BelKippen1992a]:

Phrase    → Motif_res Motif_res  [0.6]    # Two resonant motifs (open section)
Phrase    → Motif_res Motif_sec  [0.3]    # Resonant then dry (contrast)
Phrase    → Motif_sec            [0.1]    # Dry motif alone (rare, khali section)

Motif_res → dha dhin dhin dha    [0.5]    # Common resonant motif (tintāl theka)
Motif_res → dha tirakita dhin    [0.3]    # Ornate motif (tirakita = fast sequence)
Motif_res → ghe dha dhin         [0.2]    # Motif with bāyāṅ attack

Motif_sec → tin ta tin            [0.6]    # Simple dry motif
Motif_sec → ta tirakita ta        [0.4]    # Ornate dry motif

This grammar expresses regularities that Kippen and Bel observed among the tabla masters of Lucknow:

60% of phrases chain two resonant motifs (dha, dhin bols, struck with both hands — rich, open sound)
Dry motifs (tin, ta bols, struck with one hand — sharp, clear sound) are less frequent and typically appear in the khali section (non-resonant section) of the rhythmic cycle
The dha dhin dhin dha motif is the most probable (50%) because it is the basic motif of the theka (ostinato) of the tintāl

A possible generation:

Phrase → Motif_res Motif_sec          [0.3]
       → dha dhin dhin dha Motif_sec  [0.5]
       → dha dhin dhin dha tin ta tin  [0.6]

P = 0.3 × 0.5 × 0.6 = 0.09

This sequence — a resonant motif followed by a dry motif — has a 9% chance of being generated. It reproduces the typical contrast between the thali (resonant) and khali (dry) sections of a tabla cycle.

Why normalize to 1?

The constraint that probabilities sum to 1 for each non-terminal is not arbitrary. It is a fundamental mathematical requirement.

Sidebar: The Urn Analogy

Imagine an urn containing balls of different colors. If you say “there’s a 70% chance of drawing a red ball and a 50% chance of drawing a blue ball,” it doesn’t make sense: 70% + 50% = 120%! The probabilities of mutually exclusive events (you draw ONE ball, not two) must sum to 100%.

It’s exactly the same for the rules of a non-terminal: when you expand S, you choose ONE rule from the alternatives. These choices are mutually exclusive, so their probabilities must sum to 1.

Normalization ensures that:

The distribution is mathematically valid: values can be interpreted as true probabilities.

The sum of probabilities of all possible sentences equals 1 (for grammars that always terminate). Each generated sentence “consumes” a portion of this total probability.

Analyses can be compared: if a sentence has two possible derivation trees (ambiguous grammar), the one with the highest probability is “preferred”.

Counter-example — what happens without normalization?

If we had:

S → A    [0.7]
S → B    [0.5]

The sum is 1.2 — this is not a valid probability distribution. Concretely, three things can happen depending on the system used:

Pure rejection: the parser rejects the grammar as malformed (this is the strictest choice).

Implicit renormalization: the system divides each weight by the total sum, giving P(A) = 0.7/1.2 ≈ 0.583 and P(B) = 0.5/1.2 ≈ 0.417. This works in practice, but the probabilities “declared” by the grammar’s author no longer correspond to what is actually applied — a source of confusion.

Naive use: the system draws a random number between 0 and 1 and iterates through the rules. If the sum exceeds 1 (as here, 1.2), the last rule “overflows” the [0, 1] segment: B should cover the interval [0.7, 1.2], but draws never exceed 1, so its effective probability drops to 0.3 instead of 0.5. Inverseley, if the sum is less than 1 (e.g., 0.3 + 0.4 = 0.7), there is a “dead zone” of 30% where no rule is selected — the derivation silently fails. In both cases, the actual behavior diverges from the composer’s intention.

In formal theory, only option 1 is correct: a PCFG requires normalization. BP3, as we will see, chooses option 2 — but explicitly assumes it by design.

BP3 vs PCFG: Two Philosophies of Randomness

BP3 (Bol Processor) also uses weights to choose between rules, but with fundamental differences.

BP3’s Static Weights (RND mode, random)

Note: BP3 does not use probabilities between 0 and 1 like PCFGs. It uses weights — arbitrary positive integers that express relative importance. BP3 is responsible for automatically normalizing them into probabilities at runtime (option 2 of the counter-example above, but assumed by design).

gram#1[1] S → <3> A
gram#1[2] S → <2> B
gram#1[3] S → <1> C

The weights 3, 2, 1 do not sum to 1 — and that’s normal. BP3 calculates probabilities on the fly through renormalization:

P(A) = 3/(3+2+1) = 3/6 = 0.5
P(B) = 2/(3+2+1) = 2/6 ≈ 0.333
P(C) = 1/(3+2+1) = 1/6 ≈ 0.167

The result after normalization is indeed a valid distribution (sum = 1). The advantage of this approach: the composer reasons in relative proportions (“A is 3× more frequent than C”) without worrying about normalization. It’s more intuitive than manually calculating fractions that sum to 1.

So far, the result is mathematically equivalent to a PCFG — only the interface changes (relative weights vs. explicit probabilities).

BP3’s Dynamic Weights: The Big Difference

gram#1[1] S → <50-12> A S
gram#1[2] S → <1> fin

The notation <50-12> means: “initial weight 50, decremented by 12 with each use”.

Evolution over derivations:

Usage	Weight of A	Weight of fin	P(A)
1	50	1	98%
2	38	1	97%
3	26	1	96%
4	14	1	93%
5	2	1	67%
6	0	1	0%

After 5 uses, rule A becomes impossible and fin is forced.

What this allows:

Modeling the exhaustion of a musical idea
Forcing the termination of recursions
Creating structures of controlled length

Why this is no longer a PCFG:

A PCFG has fixed probabilities. The same non-terminal always has the same distribution, regardless of the derivation history. This is called the Markov property (or “memoryless”): the probability of a choice depends only on the current state, not on how it was reached.

Sidebar: Markovian vs. Non-Markovian Process

Markovian (memoryless): Imagine a deck of cards where, after each draw, you put the card back in the deck and shuffle. It doesn’t matter if you drew the ace of spades in the previous turn — the probability of drawing it again remains exactly the same (1/52). The deck “doesn’t remember” past draws. A PCFG works this way: S always has the same probabilities.

Non-Markovian (with memory): Now imagine the same game, but without putting the cards back. You draw the ace of spades: 51 cards remain. You draw the king of hearts: 50 cards remain. With each draw, the composition of the deck changes, and the probabilities of all remaining cards are modified. The deck “remembers” the complete history. BP3’s dynamic weights work this way: each use of a rule modifies its weight, and thus the probability distribution for the next choice.

BP3’s dynamic weights introduce memory that violates this assumption. In formal terms, BP3 with dynamic weights is a stochastic process (involving an element of randomness) that is non-Markovian: the probability of a choice depends on the complete history of past choices.

Comparative Table

Aspect	PCFG	BP3 (Dynamic Weights)
Probabilities	Fixed	Time-varying
Memory	None (Markov)	Present (counters)
Normalization	Explicit (sum = 1)	Implicit (calculated on the fly)
Termination	Not guaranteed	Can be forced
Statistical Analysis	Well understood	More complex

Applications of PCFGs

1. Natural Language Processing

PCFGs are at the heart of statistical parsers. When Google Translate analyzes a sentence, it uses (among other things) probabilities to choose the correct grammatical structure.

Example of resolved ambiguity:
“I saw the man with the telescope”

Interpretation 1: I was using a telescope to see the man
Interpretation 2: The man was holding a telescope

A PCFG trained on real text will statistically prefer the most frequent interpretation.

2. Procedural Text Generation

Video games use PCFGs to generate descriptions, dialogues, and place names. Probabilities ensure variety while maintaining a consistent “style.”

3. Musical Modeling

The historical application of probabilistic grammars to music is the work of Bernard Bel and James Kippen at CNRS/GRTC (Marseille) in the 1980s-90s. Their project — the Bol Processor — originated from modeling tabla compositions from the Lucknow gharānā (school) [KippenBel1989]. By collaborating with Indian master percussionists, they showed that tabla bol sequences exhibit a grammatical structure: recurring motifs, variations, and proportions of appearance can be captured by probabilistic grammars [BelKippen1992a]. This pioneering approach gave birth to BP3’s weighting mechanism, which we will detail in subsequent articles.

More broadly, PCFGs have been used to:

Analyze the style of composers (what chord progressions are typical of Bach?)
Generate jazz improvisations (Keller & Morrison, 2007)
Create automatic harmonizations

Sidebar: The Indian Origin of BP3 Grammars

In the 1980s, ethnomusicologist James Kippen (University of Toronto) was studying the tabla of the Lucknow gharānā — a percussion tradition passed down orally for generations. He met Bernard Bel (CNRS, Marseille), a computer scientist specializing in knowledge representation. Together, they had a foundational insight: the sequences of bols (percussion syllables) that master tabla players compose and improvise are not random — they follow implicit grammatical rules with observable frequencies of appearance [KippenBel1989].

Their “bol processor” — the Bol Processor — was the first system to apply probabilistic grammars to non-Western music. The foundational article [BelKippen1992a] describes how derivation modes and probabilistic weights allow modeling the richness of tabla compositions. Thirty years later, this approach has influenced major tools like TidalCycles (Alex McLean cites [Bel2001] in 8+ publications) and remains a reference in surveys on algorithmic composition [Fernández & Vico, 2013].

4. Bioinformatics

The analysis of RNA (Ribonucleic Acid) sequences uses SCFGs (Stochastic Context-Free Grammar, a variant of PCFGs adapted for bioinformatics) to model secondary structures.

Key Takeaways

A PCFG associates a probability with each production rule of a context-free grammar.
Probabilities must be normalized: for each non-terminal, the sum of the probabilities of its alternatives equals 1.
The probability of a derivation is the product of the probabilities of all rules used.
PCFGs have static (fixed) probabilities, unlike BP3’s dynamic weights which evolve during derivation.
PCFGs are used in computational linguistics, procedural generation, and musical modeling.
Direction: this article describes probabilities in the sense of production (generation). BP3 also has an analytical mode (ANAL) where weights are learned: each rule used during sequence recognition sees its weight incremented — a learning mechanism integrated directly into the grammar (B8).

To go further

Foundational article: Booth & Thompson (1973), “Applying Probability Measures to Abstract Languages”
Reference book: Jurafsky & Martin, Speech and Language Processing, chapter on PCFGs
Practical tutorial: Stanford NLP – PCFG Parsing
BP3 and weights: Bol Processor – Grammar Control
Grammars and tabla: Bel, B. & Kippen, J. (1992), “Modelling Music with Grammars: Formal Language Representation in the Bol Processor” — the main article on PCFGs applied to tabla (46 citations)
The genesis: Kippen, J. & Bel, B. (1989), “The Identification and Modelling of a Percussion ‘Language'” — how bols were formalized as a language
The three directions of BP3: B8 — production, analysis, and grammatical inference, including weight learning in analytical mode

Glossary

Ambiguity (grammar): A property of a grammar where the same string can be derived in several different ways (multiple possible parse trees).
Derivation tree (parse tree): A visual representation of a derivation, with the start symbol at the root and terminals at the leaves.
Bol: A mnemonic syllable of the Indian tabla (e.g., dha, dhin, tin, ta). The “Bol” in “Bol Processor”.
CFG (Context-Free Grammar): A grammar where each rule replaces a single non-terminal symbol, independently of the surrounding context.
Context-free: A property of a rule that applies regardless of the symbol’s environment. The replacement of A by xyz does not depend on what precedes or follows A.
Derivation: A sequence of rule applications starting from the initial symbol (S) down to a string containing only terminals.
Non-terminal: An intermediate symbol (variable) that will be replaced during derivation. By convention, written in uppercase (S, NP, VP).
Normalization: A mathematical constraint that the probabilities of a set of mutually exclusive alternatives sum to 1 (100%).
PCFG (Probabilistic CFG): A context-free grammar where each production rule is associated with a probability.
Dynamic weight: In BP3, a numerical weight associated with a rule that changes (usually decreases) after each use of that rule.
Markovian process (memoryless): A stochastic process where the probability of a future event depends only on the present state, not on the history.
Non-Markovian process (with memory): A stochastic process where the probability of an event depends on the history of past events.
Production rule: An instruction of the form A → x indicating that A can be replaced by x.
Khali: The “empty” (non-resonant) section of an Indian tāla cycle, where dry bols (tin, ta) replace resonant bols (dha, dhin).
Stochastic: Involving a random element. A stochastic process is a process whose evolution depends on chance.
Terminal: A final symbol that appears in the output and cannot be rewritten further. By convention, written in lowercase.
Theka: The basic pattern of an Indian tāla, played as an ostinato by the tabla. A fixed sequence of bols that defines the character of the rhythmic cycle.
Tintāl: The most common tāla (rhythmic cycle) in North Indian classical music: 16 beats in 4 groups of 4.

Prerequisites: L2, I2
Reading time: 8 min
Tags: #PCFG #probabilities #grammars #BP3 #generation