L2) Context-Free Grammars

The Basis of Everything

How can a computer understand that 2 + 3 * 4 equals 14 and not 20? The answer lies in three letters: CFG.

Where Does This Article Fit In?

This article is part of the Formal Languages (L) series. It delves into Type 2 of the Chomsky hierarchy (see L1): context-free grammars, the foundation of most parsers (syntax analyzers) and compilers (programs that translate a source language into a target language).


Why It’s Important

Every time you write Python code, JSON, or even a formula in Excel, a program must understand the structure of what you’ve written. This understanding doesn’t happen by magic: it relies on precise rules that describe how elements fit together.

Context-Free Grammars (CFG) are the fundamental tool for describing these rules. They are at the heart of all compilers, all parsers, all tools that transform structured text into something usable. Without them, no Python, no JavaScript, no HTML… and no BP3 (Bol Processor 3, a musical grammar language — see I2), the musical notation we transpile (translate from a source language to another language of the same level) to SuperCollider (an audio programming environment — see I3).

Understanding CFGs means understanding how languages are defined and how machines interpret them.

The Idea in One Sentence

A context-free grammar is a set of rewrite rules that describe how to construct all valid sentences of a language, without taking context into account.


Let’s Explain Step by Step

The Structure of a Grammar: G = (V, Σ, R, S)

A context-free grammar is defined by four elements, traditionally denoted G = (V, Σ, R, S):

Symbol Name Role Example
V Variables (non-terminals) Abstract symbols that will be replaced Expr, Term, Factor
Σ Alphabet (terminals) Concrete symbols of the final language +, *, (, ), 0-9
R Production Rules How to replace variables Expr → Expr + Term
S Start Symbol Where the derivation begins Expr

Why these Greek letters?

  • Σ (uppercase sigma) represents the alphabet, by mathematical tradition
  • The arrow (or -->) reads “rewrites to” or “can be replaced by”
  • The vertical bar | means “or” (alternative between several possibilities)

Thus, Expr → Expr + Term | Term reads: “An expression can be rewritten either as ‘expression + term’, or as ‘term’ alone.”

Why “context-free”?

The term “context-free” means that each rule can be applied independently of what surrounds the symbol. If we have the rule A → xy, we can always replace A with xy, regardless of what comes before or after A. This property is what makes CFGs efficiently parsable.

Example 1: Arithmetic Expressions

Let’s build a grammar for expressions like 2 + 3 * 4 or (1 + 2) * 3.

The four components:

 

V = { Expr, Term, Factor }          -- Variables
Σ = { +, *, (, ), 0, 1, ..., 9 }      -- Terminals
S = Expr                               -- Start Symbol
R = {                                  -- Rules
    Expr    → Expr + Term | Term
    Term    → Term * Factor | Factor
    Factor → ( Expr ) | number
}

 

Derivation of 2 + 3 * 4:

Let’s start from the start symbol and apply the rules:

 

Expr
→ Expr + Term           (rule 1)
→ Term + Term          (rule 1, "Term" case)
→ Factor + Term        (rule 2, "Factor" case)
→ 2 + Term              (rule 3, "number" case)
→ 2 + Term * Factor    (rule 2)
→ 2 + Factor * Factor  (rule 2, "Factor" case)
→ 2 + 3 * Factor        (rule 3)
→ 2 + 3 * 4              (rule 3)

 

Why does it result in 14 and not 20?

The grammar’s structure encodes operator precedence. Multiplication is “deeper” in the tree (it appears in Term), so it is evaluated first. The grammar forces the interpretation 2 + (3 * 4) and not (2 + 3) * 4.

This is the magic of CFGs: the grammatical structure captures the semantics.

The Derivation Process

A derivation is a sequence of replacements starting from the start symbol and resulting in a string of terminals (the final text). It is the core of how a grammar works.

Analogy: Derivation as a Cooking Recipe

Imagine a recipe: “Dessert → Pie | Cake”, “Pie → Crust + Filling”, etc.

Derivation is following the steps: starting from “Dessert”, replacing with “Pie”, then “Pie” with “Crust + Filling”, until you get the concrete ingredients (terminals).

Types of Derivation:

  • Leftmost derivation: at each step, the leftmost non-terminal is replaced
  • Rightmost derivation: at each step, the rightmost non-terminal is replaced

Both strategies lead to the same final result, but the order of intermediate steps differs. This distinction is important because different types of parsers use one strategy or the other:

Parser Type Derivation Used Example Tool
LL (top-down) Leftmost ANTLR (ANother Tool for Language Recognition), recursive descent parsers
LR (bottom-up) Rightmost Bison, Yacc (Yet Another Compiler-Compiler)

Derivation Tree:

Each derivation can be represented as a tree, where:

  • The root is the start symbol
  • Internal nodes are non-terminals
  • Leaves are terminals
  • Each level represents the application of a rule

 

        Expr
       /  |  \
    Expr  +  Term
      |     /  |  \
   Term Term *  Factor
      |     |        |
  Factor Factor    4
      |     |
      2     3

 

Example 2: A Simplified Musical Grammar

Let’s imagine a minimalist musical notation:

 

V = { Phrase, Motif, Note }
Σ = { do, ré, mi, fa, sol, -, ( , ) }
S = Phrase
R = {
    Phrase → Motif | Phrase Motif
    Motif  → Note | ( Phrase )
    Note   → do | ré | mi | fa | sol | -
}

 

This grammar allows generating:

  • do ré mi (three notes)
  • do - ré (note, rest, note)
  • ( do ré ) ( mi fa ) (two grouped motifs)
  • ( do ( ré mi ) fa ) (nesting)

Derivation of do ( ré mi ) fa:

 

Phrase
→ Phrase Motif                     -- add a motif
→ Phrase Motif Motif               -- add another
→ Phrase Motif fa                  -- Note → fa
→ Phrase ( Phrase ) fa             -- Motif → ( Phrase )
→ Phrase ( Phrase Motif ) fa       -- Phrase → Phrase Motif
→ Phrase ( Motif Motif ) fa        -- Phrase → Motif
→ Phrase ( ré Motif ) fa           -- Note → ré
→ Phrase ( ré mi ) fa              -- Note → mi
→ Motif ( ré mi ) fa               -- Phrase → Motif
→ do ( ré mi ) fa                  -- Note → do

 

The Problem of Ambiguity

A grammar is ambiguous if the same string can be derived in several different ways (multiple derivation trees).

Example of an ambiguous grammar:

 

Expr → Expr + Expr | Expr * Expr | number

 

For 2 + 3 * 4, two trees are possible:

 

    Expr                    Expr
   / | \                   / | \
Expr + Expr            Expr * Expr
 |    / | \           / | \    |
 2  Expr * Expr     Expr + Expr 4
      |     |         |     |
      3     4         2     3

 

The first gives 2 + (3 * 4) = 14, the second gives (2 + 3) * 4 = 20.

Why is this serious?

Ambiguity means that the parser doesn’t know which interpretation to choose. For a programming language, this is unacceptable: the code must have a unique meaning.

Solutions:

  1. Rewrite the grammar to eliminate ambiguity (this is what we did in our first example with Term and Factor)
  2. Disambiguation rules: operator precedence and associativity
  3. Accept ambiguity and resolve it semantically (rare)

Properties of CFGs

What CFGs can describe:

  • Nested structures (parentheses, HTML/XML tags)
  • Recursion (definitions that refer to themselves)
  • The syntax of most programming languages

What is recursion?

A definition is recursive when it refers to itself. For example:

Expr → Expr + Term

An Expr contains another Expr! It’s like Russian nesting dolls, or mirrors facing each other.

Recursion allows expressing structures of arbitrary depth: 1+2+3+4+... with a single rule.

What CFGs CANNOT describe:

Limitation Formal Notation Concrete Example
Triple counting equality aⁿbⁿcⁿ “as many a’s, then b’s, then c’s”
Declaration before use Checking that a variable is declared before being used
Long-distance agreements Subject-verb agreement with intervening words

Reading aⁿbⁿcⁿ

This mathematical notation means: “n repetitions of ‘a’, followed by n repetitions of ‘b’, followed by n repetitions of ‘c'”.

Valid examples: abc (n=1), aabbcc (n=2), aaabbbccc (n=3)

A CFG cannot guarantee that the three groups have exactly the same number of elements.

These limitations motivate the existence of context-sensitive grammars (Type 1 of the Chomsky hierarchy — see L1) and additional semantic analyses (concerning meaning) performed after parsing (syntactic analysis).

How CFGs Apply to BP3

The syntax of BP3, the musical grammar language, is itself described by a CFG. Here is a simplified excerpt:

 

bp_file      → header* grammar_section+
grammar_section → mode_line rule_line+
rule_line    → weight? lhs "-->" rhs
lhs          → nonterminal+
rhs          → element*
element      → note | rest | nonterminal | polymetric | ...

 

A BP3 rule like:

 

<50> S --> A B C

 

Breaks down into:

  • weight : <50>
  • lhs : S
  • rhs : A B C

The parser of the BP2SC transpiler (BP3-to-SuperCollider, see B7) uses this structure to build an AST (Abstract Syntax Tree — see L4), which will then be translated into SuperCollider code.


Key Takeaways

  1. G = (V, Σ, R, S) : Variables, Terminals, Rules, Start Symbol. These four elements completely define a grammar.
  2. “Context-free” means that rules apply independently of the surrounding context. This is what makes parsing efficient.
  3. Derivation is the process of successively replacing variables until a string of terminals is obtained.
  4. The derivation tree captures the hierarchical structure of the sentence, which determines its interpretation.
  5. Ambiguity is the enemy: an ambiguous grammar produces multiple possible interpretations. It is resolved by careful rule design.
  6. CFGs are at the heart of compilation: the syntax of Python, Java, JSON, HTML, and BP3 is described by CFGs.

To Go Further

  • Reference Book: Aho, Lam, Sethi & Ullman, Compilers: Principles, Techniques, and Tools (the “Dragon Book”)
  • Online Course: Coursera, Automata Theory by Jeffrey Ullman (Stanford)
  • Practical Tool: JFLAP for visualizing CFGs and their derivations
  • Formal Standard: ISO 14977 for EBNF notation

Glossary

Term Definition
Ambiguity A property of a grammar that allows multiple derivation trees for the same string. An ambiguous grammar is problematic because the parser does not know which interpretation to choose.
Derivation Tree A graphical representation of a derivation in the form of a tree. The root is the start symbol, the leaves are the terminals, and each internal node represents the application of a rule.
CFG Context-Free Grammar. A grammar where each rule replaces a single non-terminal, independently of its context. Type 2 in the Chomsky hierarchy.
Derivation A sequence of rule applications transforming the start symbol into a string of terminals. The central process of how a grammar works.
Chomsky Hierarchy A classification of formal grammars into 4 types (0 to 3) according to their expressive power, proposed by Noam Chomsky in 1956. CFGs are Type 2.
Non-terminal An abstract symbol (variable) that will be replaced by other symbols according to the rules. Generally denoted in uppercase or within brackets.
Parser A program that analyzes a string of characters according to a grammar and builds its structure (derivation tree or AST). Synonym: syntax analyzer.
Production Synonym for “rewrite rule”. Defines how a non-terminal can be replaced.
Recursion A property of a rule where a non-terminal appears in its own definition (e.g., Expr → Expr + Term). Allows expressing structures of arbitrary depth.
Σ (Sigma) Greek letter traditionally used to denote the alphabet (set of terminal symbols) of a grammar.
Terminal A concrete symbol of the final language, which cannot be replaced. These are the “words” of the vocabulary: operators, keywords, literal values.

Prerequisite : L1
Reading time : 12 min
Tags : #grammars #cfg #context-free #parsing #compilation


Next article : L3 — EBNF: The Language for Describing Languages