B1) PCFG

When grammars play dice

What if a grammar could choose its rules at random, but not just any which way?

Where does this article fit in?

This article opens the BP3 series (Bol Processor, version 3 — see I2). It extends context-free grammars (L2) by adding probabilities — the fundamental mechanism that allows the Bol Processor to generate varied music while remaining controlled.

Box: What is a PCFG?

PCFG stands for Probabilistic Context-Free Grammar. It is an extension of classic context-free grammars (CFG) where each production rule is associated with a probability. The term “context-free” means that replacing a symbol does not depend on what surrounds it: you can replace A with x y z whether A is at the beginning, middle, or end of the string.

Why is it important?

Imagine you are writing a grammar to generate sentences in English. You have several ways to form a noun phrase: “the cat”, “a small black cat”, “that old alley cat”. Technically, all these forms are correct. But in real life, “the cat” is much more frequent than “that old alley cat”.

A classic context-free grammar (CFG) treats all alternatives equally — it doesn’t know that one phrasing is more likely than another. Probabilistic grammars (PCFG) correct this shortcoming by associating a probability with each rule.

Result: we can not only recognize if a sentence is grammatical, but also:

Calculate what the probability is that it was produced
Choose the most likely analysis when several are possible
Generate sequences with realistic frequencies

In music, this idea makes perfect sense. A composer who improvises does not choose notes at pure random: certain progressions are more “natural” than others depending on the style. PCFGs allow these statistical trends to be modeled.

The idea in one sentence

A PCFG is a context-free grammar where each production rule has a probability, and where the probabilities of all alternatives for the same symbol sum to 1.

Explaining step by step

The structure of a PCFG

A PCFG uses exactly the same structure as a CFG, with four components:

Non-terminal symbols (N): intermediate variables that will be replaced. Examples: S (sentence), NP (Noun Phrase), VP (Verb Phrase). By convention, they are written in uppercase.

Terminal symbols (T): the final elements that appear in the result. Examples: the words “the”, “cat”, “eats”. By convention, they are written in lowercase or in quotes.

Start symbol: the initial non-terminal from which all generation begins (usually S for Sentence).

Production rules: instructions of the form $A \to x$ that say “A can be replaced by x”.

The difference with a classic CFG? In a PCFG, each rule is annotated with a probability in brackets $$[p]$$ , where $$p$$ is a number between 0 and 1.

Example 1: a grammar of simple sentences

S  → NP VP     [0.8]    # A sentence = noun phrase + verb phrase (80%)
S  → VP        [0.2]    # A sentence = just a verb phrase (20%)

NP → Det N     [0.6]    # Noun phrase = determiner + noun (60%)
NP → N         [0.4]    # Noun phrase = just a noun (40%)

VP → V NP      [0.7]    # Verb phrase = verb + object (70%)
VP → V         [0.3]    # Verb phrase = just a verb (30%)

Det → the      [0.5]    # The determiner is "the" (50%)
Det → a        [0.5]    # The determiner is "a" (50%)

N  → cat       [0.6]    # The noun is "cat" (60%)
N  → dog       [0.4]    # The noun is "dog" (40%)

V  → eats      [0.5]    # The verb is "eats" (50%)
V  → sleeps    [0.5]    # The verb is "sleeps" (50%)

Box: Reading a PCFG rule

The notation $S \to NP\ VP\ [0.8]$ is read as:

“S can be replaced by NP followed by VP”

“This rule has a probability of 0.8 (80%)”

It’s like a loaded die: if you have two ways to expand S, with probabilities 0.8 and 0.2, imagine a 10-sided die where 8 sides give the first rule and 2 sides the second.

Normalization check:

For each non-terminal, the sum of the probabilities of its rules must equal 1:

$$S : 0.8 + 0.2 = 1.0$$
$$NP : 0.6 + 0.4 = 1.0$$
$$VP : 0.7 + 0.3 = 1.0$$
$$Det : 0.5 + 0.5 = 1.0$$
$$N : 0.6 + 0.4 = 1.0$$
$$V : 0.5 + 0.5 = 1.0$$

Calculating the probability of a sentence:

Let’s take “the cat sleeps”. A derivation is the sequence of rule applications that transforms the start symbol S into the final sentence. Here is how to calculate its probability step by step:

Step 1: S → NP VP                  Probability of this rule: 0.8
Step 2: NP → Det N (within NP VP)  Probability of this rule: 0.6
Step 3: Det → the                  Probability of this rule: 0.5
Step 4: N → cat                    Probability of this rule: 0.6
Step 5: VP → V                     Probability of this rule: 0.3
Step 6: V → sleeps                 Probability of this rule: 0.5

The total probability is the product of all probabilities (as for independent events in probability):

P("the cat sleeps") = 0.8 × 0.6 × 0.5 × 0.6 × 0.3 × 0.5 = 0.0216

Box: Why multiply?

Imagine flipping a coin several times. The probability of getting heads then tails then heads is: $P(\text{heads}) \times P(\text{tails}) \times P(\text{heads}) = 0.5 \times 0.5 \times 0.5 = 0.125$

It’s the same principle: each rule choice is independent, so we multiply the probabilities. The longer the derivation (the more rules applied), the smaller the final probability.

That’s about a 2.16% chance of generating exactly this sentence with this grammar. This might seem low, but it’s consistent: this grammar can generate dozens of different sentences.

Example 2: a probabilistic tabla grammar

The Bol Processor was created to model tabla compositions — the two-headed drum of Indian classical music [KippenBel1989]. Percussion syllables (bols) like dha, dhin, tin, ta are the natural terminals of these grammars. Here is a simplified PCFG inspired by the work of Bel and Kippen [BelKippen1992a]:

Sentence    → Motif_res Motif_res  [0.6]    # Two resonant motifs (open section)
Sentence    → Motif_res Motif_sec  [0.3]    # Resonant then dry (contrast)
Sentence    → Motif_sec            [0.1]    # Dry motif alone (rare, khali section)

Motif_res → dha dhin dhin dha    [0.5]    # Common resonant motif (tintāl theka)
Motif_res → dha tirakita dhin    [0.3]    # Ornamented motif (tirakita = fast sequence)
Motif_res → ghe dha dhin         [0.2]    # Motif with bāyāṅ attack

Motif_sec → tin ta tin            [0.6]    # Simple dry motif
Motif_sec → ta tirakita ta        [0.4]    # Ornamented dry motif

This grammar expresses regularities that Kippen and Bel observed among Lucknow tabla masters:

60% of sentences link two resonant motifs (bols dha, dhin, struck with both hands — rich and open sound)
Dry motifs (bols tin, ta, struck with one hand — high and clear sound) are less frequent and typically appear in the khali section (section without resonance) of the rhythmic cycle
The motif dha dhin dhin dha is the most likely (50%) because it is the basic motif of the theka (ostinato) of tintāl

A possible generation:

Sentence → Motif_res Motif_sec          [0.3]
       → dha dhin dhin dha Motif_sec  [0.5]
       → dha dhin dhin dha tin ta tin  [0.6]

P = 0.3 × 0.5 × 0.6 = 0.09

This sequence — a resonant motif followed by a dry motif — has a 9% chance of being generated. It reproduces the typical contrast between the thali (resonant) and khali (dry) sections of a tabla cycle.

Why normalize to 1?

The constraint that probabilities sum to 1 for each non-terminal is not arbitrary. It is a fundamental mathematical requirement.

Box: The urn analogy

Imagine an urn containing balls of different colors. If you say “there is a 70% chance of drawing a red ball and a 50% chance of drawing a blue ball”, it makes no sense: 70% + 50% = 120%! The probabilities of mutually exclusive events (you draw ONE ball, not two) must sum to 100%.

It’s exactly the same for the rules of a non-terminal: when you expand S, you choose ONE rule among the alternatives. These choices are mutually exclusive, so their probabilities must sum to 1.

Normalization guarantees that:

The distribution is mathematically valid: the values can be interpreted as true probabilities.

The sum of the probabilities of all possible sentences equals 1 (for grammars that always terminate). Each generated sentence “consumes” a part of this total probability.

We can compare analyses: if a sentence has two possible derivation trees (ambiguous grammar), the one with the highest probability is “preferred.”

Counter-example — what happens without normalization?

If we had:

S → A    [0.7]
S → B    [0.5]

The sum is 1.2 — this is not a valid probability distribution. Concretely, three things can happen depending on the system used:

Pure rejection: the parser rejects the grammar as malformed (this is the strictest choice).

Implicit renormalization: the system divides each weight by the total sum, giving $P(A) = 0.7/1.2 \approx 0.583$ and $P(B) = 0.5/1.2 \approx 0.417$ . This works in practice, but the probabilities “declared” by the grammar author no longer correspond to what is actually applied — a source of confusion.

Naive usage: the system draws a random number between 0 and 1 and iterates through the rules. If the sum exceeds 1 (as here, 1.2), the last rule “overflows” the $$[0, 1]$$ segment: B should cover the interval $$[0.7, 1.2]$$ , but the draws never exceed 1, so its effective probability falls to 0.3 instead of 0.5. Conversely, if the sum is less than 1 (for example $$0.3 + 0.4 = 0.7$$ ), there is a “dead” zone of 30% where no rule is selected — the derivation fails silently. In both cases, the actual behavior diverges from the composer’s intention.

In formal theory, only option 1 is correct: a PCFG requires normalization. BP3, as we will see, chooses option 2 — but assumes it explicitly.

BP3 vs PCFG: two philosophies of randomness

BP3 (Bol Processor) also uses weights to choose between rules, but with fundamental differences.

Static weights in BP3 (RND mode)

Note: BP3 does not use probabilities between 0 and 1 like PCFGs. It uses weights — arbitrary positive integers that express relative importance. BP3 takes care of automatically normalizing them into probabilities at runtime (option 2 of the counter-example above, but assumed by design).

gram#1[1] S → <3> A
gram#1[2] S → <2> B
gram#1[3] S → <1> C

The weights 3, 2, 1 do not sum to 1 — and that’s normal. BP3 calculates probabilities on the fly by renormalization:

$P(A) = \frac{3}{3+2+1} = \frac{3}{6} = 0.5$
$P(B) = \frac{2}{3+2+1} = \frac{2}{6} \approx 0.333$
$P(C) = \frac{1}{3+2+1} = \frac{1}{6} \approx 0.167$

The result after normalization is indeed a valid distribution (sum = 1). The advantage of this approach: the composer reasons in relative proportions (“A is 3× more frequent than C”) without worrying about normalization. This is more intuitive than manually calculating fractions that sum to 1.

So far, the result is mathematically equivalent to a PCFG — only the interface changes (relative weights vs explicit probabilities).

Dynamic weights in BP3: the big difference

gram#1[1] S → <50-12> A S
gram#1[2] S → <1> end

The notation <50-12> means: “initial weight 50, decremented by 12 at each use.”

Evolution over derivations:

Use	Weight of A	Weight of end	$$P(A)$$
1	50	1	98%
2	38	1	97%
3	26	1	96%
4	14	1	93%
5	2	1	67%
6	0	1	0%

After 5 uses, rule A becomes impossible and end is forced.

What this allows:

Modeling the exhaustion of a musical idea
Forcing recursion termination
Creating structures of controlled length

Why it’s no longer a PCFG:

A PCFG has fixed probabilities. The same non-terminal always has the same distribution, regardless of the derivation history. This is called the Markov property (or “memoryless”): the probability of a choice depends only on the current state, not on how we got there.

Box: Markovian vs non-Markovian process

Markovian (memoryless): Imagine a card game where, after each draw, you put the card back in the deck and shuffle. It doesn’t matter if you drew the ace of spades in the previous round — the probability of drawing it again remains exactly the same ( $$1/52$$ ). The deck “does not remember” past draws. A PCFG works like this: S always has the same probabilities.

Non-Markovian (with memory): Now imagine the same game, but without putting the cards back. You draw the ace of spades: 51 cards remain. You draw the king of hearts: 50 cards remain. With each draw, the composition of the deck changes and the probabilities of all remaining cards are modified. The deck “remembers” the full history. BP3’s dynamic weights work like this: each use of a rule modifies its weight, and therefore the probability distribution for the next choice.

BP3’s dynamic weights introduce a memory that violates this assumption. Technically, BP3 with dynamic weights is an extended-state Markovian process: if we include the weight counters in the state, the process becomes Markovian again. But from the perspective of observable probabilities (those a listener perceives), each choice depends on the number of times each rule has already been used — which creates an effectively memory-based behavior, like the card game without replacement above.

Comparison Table

Aspect	PCFG	BP3 (dynamic weights)
Probabilities	Fixed	Variable over time
Memory	None (Markov)	With (counters)
Normalization	Explicit ( $\sum = 1$ )	Implicit (calculated on the fly)
Termination	Not guaranteed	Can be forced
Statistical analysis	Well understood	More complex

Applications of PCFGs

1. Natural Language Processing

PCFGs are at the heart of statistical syntactic parsers. When Google Translate analyzes a sentence, it uses (among other things) probabilities to choose the correct grammatical structure.

Example of resolved ambiguity:
“I saw the man with the telescope”

Interpretation 1: I was using a telescope to see the man
Interpretation 2: The man was holding a telescope

A PCFG trained on real text will statistically prefer the most frequent interpretation.

2. Procedural text generation

Video games use PCFGs to generate descriptions, dialogues, and place names. Probabilities ensure variety while maintaining a consistent “style.”

3. Musical modeling

The historical application of probabilistic grammars to music is the work of Bernard Bel and James Kippen at CNRS/GRTC (Marseille) in the 1980s-90s. Their project — the Bol Processor — was born from modeling tabla compositions from the Lucknow gharānā (school) [KippenBel1989]. By collaborating with Indian percussion masters, they showed that tabla bol sequences exhibit a grammatical structure: recurring motifs, variations, and appearance proportions can be captured by probabilistic grammars [BelKippen1992a]. It is this pioneering approach that gave birth to the BP3 weight mechanism, which we will detail in the following articles.

More broadly, PCFGs have been used to:

Analyze composers’ styles (which chord progressions are typical of Bach?)
Generate jazz improvisations (Keller & Morrison, 2007)
Create automatic harmonizations

Box: The Indian origin of BP3 grammars

In the 1980s, ethnomusicologist James Kippen (University of Toronto) was studying the tabla of the Lucknow gharānā — an oral percussion tradition passed down for generations. He met Bernard Bel (CNRS, Marseille), a computer scientist specializing in knowledge representation. Together, they had a founding intuition: the sequences of bols (percussion syllables) that tabla masters compose and improvise are not random — they follow implicit grammatical rules with observable frequencies of appearance [KippenBel1989].

Their “bol processor” — the Bol Processor — was the first system to apply probabilistic grammars to non-Western music. The founding article [BelKippen1992a] describes how derivation modes and probabilistic weights allow for modeling the richness of tabla compositions. Thirty years later, this approach has influenced major tools like TidalCycles (Alex McLean cites [Bel2001] in 8+ publications) and remains a reference in surveys on algorithmic composition [Fernández & Vico, 2013].

4. Bioinformatics

The analysis of RNA (Ribonucleic Acid) sequences uses SCFGs (Stochastic Context-Free Grammar, a variant of PCFGs adapted to bioinformatics) to model secondary structures.

Key takeaways

A PCFG associates a probability with each production rule of a context-free grammar.
Probabilities must be normalized: for each non-terminal, the sum of the probabilities of its alternatives equals 1.
The probability of a derivation is the product of the probabilities of all rules used.
PCFGs have static (fixed) probabilities, unlike BP3’s dynamic weights which evolve over the course of the derivation.
PCFGs are used in computational linguistics, procedural generation, and musical modeling.
Direction: this article describes probabilities in the sense of production (generation). The Bol Processor also has an analytical mode (ANAL) which was fully functional in BP2 — weights were learned by incrementation during recognition. This mode is not yet implemented in BP3 (Bernard Bel is working on its restoration — see B8).

To go further

Founding article: Booth & Thompson (1973), “Applying Probability Measures to Abstract Languages”
Reference book: Jurafsky & Martin, Speech and Language Processing, chapter on PCFGs
Practical tutorial: Stanford NLP – PCFG Parsing
BP3 and weights: Bol Processor – Grammar Control
Grammars and tabla: Bel, B. & Kippen, J. (1992), “Modelling Music with Grammars: Formal Language Representation in the Bol Processor” — the main article on PCFGs applied to tabla (46 citations)
The genesis: Kippen, J. & Bel, B. (1989), “The Identification and Modelling of a Percussion ‘Language'” — how bols were formalized as a language
The three directions of BP3: B8 — production (implemented), analysis (being restored), and grammatical inference (QAVAID, historical)

Glossary

Ambiguity (grammar): Property of a grammar where the same string can be derived in several different ways (multiple possible syntax trees).
Derivation tree (syntax tree): Visual representation of a derivation, with the start symbol at the root and terminals at the leaves.
Bol: Mnemonic syllable of the Indian tabla (e.g., dha, dhin, tin, ta). The “Bol” in “Bol Processor.”
CFG (Context-Free Grammar): A grammar where each rule replaces a single non-terminal symbol, regardless of the surrounding context.
Context-free: Property of a rule that applies regardless of the symbol’s environment. Replacing A with xyz does not depend on what precedes or follows A.
Derivation: Sequence of rule applications starting from the initial symbol (S) to a string containing only terminals.
Non-terminal: Intermediate symbol (variable) that will be replaced during derivation. By convention, written in uppercase (S, NP, VP).
Normalization: Mathematical constraint that the probabilities of a set of mutually exclusive alternatives sum to 1 (100%).
PCFG (Probabilistic CFG): Context-free grammar where each production rule is associated with a probability.
Dynamic weight: In BP3, a numerical weight associated with a rule that changes (usually decreases) after each use of that rule.
Markovian process (memoryless): Stochastic process where the probability of a future event depends only on the present state, not on the history.
Process with memory: Stochastic process where the probability of an event depends on the history. BP3 with dynamic weights is technically extended-state Markovian (counters are part of the state), but the observable behavior is memory-based.
Production rule: Instruction of the form $A \to x$ indicating that A can be replaced by x.
Khali: “Empty” section (without resonance) of an Indian tāla cycle, where dry bols (tin, ta) replace resonant bols (dha, dhin).
Stochastic: Involving a random element. A stochastic process is one whose evolution depends on chance.
Terminal: Final symbol that appears in the output and cannot be rewritten. By convention, written in lowercase.
Theka: Basic pattern of an Indian tāla, played as an ostinato by the tabla. A fixed sequence of bols that defines the character of the rhythmic cycle.
Tintāl: The most common tāla (rhythmic cycle) of North Indian classical music: 16 beats in 4 groups of 4.

Prerequisites: L2, I2
Reading time: 8 min
Tags: #PCFG #probabilities #grammars #BP3 #generation

← Back to index