B2) Alphabets, Terminals, and Non-Terminals
The Vocabulary of Grammars
When we talk about “formal grammar,” what exactly are we referring to? It all starts with three fundamental sets.
Where This Article Fits In
This article details the basic vocabulary of context-free grammars L2 applied to BP3. What are the terminals (notes, rests) and non-terminals (structures) of the I2? Understanding this vocabulary is essential for reading the B3.
Why Is This Important?
Before writing a single grammar rule, you must define what you are working with. It’s like a board game: before playing, you need to know the available pieces and their roles.
In a formal grammar, three types of objects interact:
- Terminal symbols: the basic building blocks that appear in the final result
- Non-terminal symbols: intermediate variables that structure the generation
- The alphabet: the set of all available symbols
Understanding this distinction is essential for reading, writing, and debugging grammars. Without it, one confuses the means (non-terminals) and the ends (terminals).
The Idea in One Sentence
The alphabet is the set of basic symbols; terminals are what is ultimately produced; non-terminals are intermediate labels that are replaced until only terminals remain.
Let’s Explain Step by Step
Terminal Symbols: The Final Product
Terminals are the symbols that appear in the final output of the grammar. They cannot be “rewritten” anymore: they are there to stay.
Examples by Domain:
| Domain | Terminals |
|---|---|
| Natural language | words: “cat”, “eats”, “the” |
| Arithmetic | digits and operators: 0, 1, +, *, (, ) |
| Programming | tokens: if, while, =, identifiers, numbers |
| Music (Western) | notes and rests: C, D, E, - |
| Music (tabla) | bols (percussion syllables): dha, dhin, tin, ta |
Conventional Notation:
- Lowercase letters:
a,b,c - Or strings in quotes:
"if","while" - Or special symbols:
+,*,(
Non-Terminal Symbols: Working Variables
Non-terminals are intermediate symbols. They represent “categories” or “concepts” that will eventually be replaced by terminals (or other non-terminals, which themselves will be replaced).
Examples:
| Domain | Non-terminals |
|---|---|
| Natural language | S (sentence), NP (noun phrase), VP (verb phrase) |
| Arithmetic | E (expression), T (term), F (factor) |
| Programming | Statement, Expression, Block |
| Music | Phrase, Motif, Cadence |
Conventional Notation:
- Uppercase letters:
S,A,B,NP - Or whole words with initial capital:
Phrase,Expression - In BP3: between vertical bars
|S|,|A|
The Alphabet: All Available Vocabulary
The alphabet is the set of all symbols that the grammar can manipulate.
Sidebar: Mathematical Notations
In formal language theory, standard mathematical notations are used:
- Σ (uppercase sigma): often denotes the alphabet of terminals
- V: denotes the complete alphabet (terminals + non-terminals)
- ∪ (union): the operation that combines two sets. If A = {1, 2} and B = {3, 4}, then A ∪ B = {1, 2, 3, 4}
- ∩ (intersection): the elements common to two sets
- ∅ (empty set): the set that contains no elements
- Two sets are disjoint if their intersection is empty: T ∩ N = ∅
The complete alphabet V is the union of terminals (T) and non-terminals (N):
V = T ∪ N
Concrete example:
- T = {a, b, c} (the terminals)
- N = {S, A, B} (the non-terminals)
- V = {a, b, c, S, A, B} (the complete alphabet)
Important constraint:
- T ∩ N = ∅ : no symbol is both terminal AND non-terminal
- This allows one to always know if a symbol still needs to be replaced (non-terminal) or if it is final (terminal)
Example 1: A Simple Arithmetic Grammar
Let’s define a grammar for simple arithmetic expressions like 2+3*4.
Terminals (T) — the symbols that appear in the final result:
T = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, *, (, ) }
|___________________________| |__| |____|
the digits operators parentheses
Non-terminals (N) — the intermediate syntactic categories:
N = { E, T, F }
Meaning:
- E = Expression (a complete expression, e.g.: "2+3*4")
- T = Term (a product, e.g.: "3*4")
- F = Factor (a basic element, e.g.: "4" or "(2+3)")
Complete alphabet:
V = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, *, (, ), E, T, F }
|_________________________________________| |_____|
terminals non-terminals
Rules:
E → E + T | T # An expression is either E+T, or just T
T → T * F | F # A term is either T*F, or just F
F → ( E ) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 # A factor is (E) or a digit
Sidebar: Reading the
|NotationThe vertical bar
|means “or”. ThusE → E + T | Tis a shortcut for two rules:
E → E + TE → T
Derivation of 2+3*4:
E
→ E + T (E → E + T)
→ T + T (E → T)
→ F + T (T → F)
→ 2 + T (F → 2)
→ 2 + T * F (T → T * F)
→ 2 + F * F (T → F)
→ 2 + 3 * F (F → 3)
→ 2 + 3 * 4 (F → 4)
The final result 2+3*4 contains only terminals. All non-terminals (E, T, F) have been replaced.
Example 2: BP3 Notation
BP3 uses a clear visual convention to distinguish terminals and non-terminals.
BP3 Non-terminals: between vertical bars
|S| -- start symbol
|A| -- any non-terminal
|Phrase| -- descriptive name
BP3 Terminals:
- Musical notes:
C4,D4,C4,D#5 - Tabla bols:
dha,dhin,tin,ta,tirakita(symbols defined in the alphabet file) - Determined rests:
-(fixed duration, one unit of time) - Undetermined rests:
_(unspecified duration, calculated by context) - Ellipses:
...(prolonged rest of undetermined duration,UndeterminedRestnode) - Symbols defined in the external alphabet (
-al.file)
Sidebar: Three Types of Rest in BP3
BP3 distinguishes three ways of “playing nothing”:
Notation Type Duration Usage -Determined Rest Fixed (1 unit) Explicit pause between notes _Undetermined Rest Calculated Flexible temporal filling ...UndeterminedRest Indefinite Prolonged rest, waiting >
The determined rest (-) is the most common: it creates a pause of a fixed duration, like a quarter rest in musical notation. The undetermined rest (_) lets the system calculate the duration based on the temporal context (useful in polymetric expressions). The ellipsis (...) represents a rest of truly indefinite duration — think of it as a fermata (hold) that suspends time.
BP3 Grammar Example:
-gr.MiniMelody
gram#1[1] |S| → |A| |B|
gram#1[2] |A| → C4 D4 E4
gram#1[3] |A| → E4 D4 C4
gram#1[4] |B| → - F4 G4
Identification:
- Non-terminals:
|S|,|A|,|B| - Terminals:
C4,D4,E4,F4,G4,-
A Possible Derivation:
|S|
→ |A| |B| (rule 1)
→ C4 D4 E4 |B| (rule 2)
→ C4 D4 E4 - F4 G4 (rule 4)
Result: a sequence of six elements, all terminals.
The Original Alphabet — a Tabla Bols File:
Historically, BP’s first alphabet was not C4, D4, E4 but Indian percussion syllables. Here’s what a tabla alphabet file (-al.Tabla) looks like:
-al.Tabla
dha dhin tin ta tirakita ga ghe ki na tun ra
Each symbol is a bol — a specific tabla stroke with its own timbre:
- Resonant bols (struck with the palm on the bāyāṅ, the bass drum):
dha,dhin,ghe,ga - Dry bols (struck with fingertips on the dāyāṅ, the treble drum):
tin,ta,ki,na,tun,ra - Combined bols (fast sequences):
tirakita(= ti-ra-ki-ta, four accelerated strokes)
A tabla grammar would use these bols exactly as our melodic grammar uses C4 and D4:
-gr.Tintal_Theka
gram#1[1] |S| → |Vibhag1| |Vibhag2| |Vibhag3| |Vibhag4|
gram#1[2] |Vibhag1| → dha dhin dhin dha
gram#1[3] |Vibhag2| → dha dhin dhin dha
gram#1[4] |Vibhag3| → dha tin tin ta
gram#1[5] |Vibhag4| → ta dhin dhin dha
This grammar generates the theka (basic pattern) of tintāl, the most common Indian rhythmic cycle (16 beats in 4 groups of 4). The non-terminals |Vibhag1|…|Vibhag4| structure the four sections of the cycle; the terminals dha, dhin, tin, ta are the bols played.
Three Types of Musical Terminals:
| System | Notation | BP3 Terminals | Alphabet Example (-al.*) |
|---|---|---|---|
| French | do, ré, mi, fa, sol, la, si | C4, D4, E4… |
-al.Mohanam : C4 D4 E4 G4 A4 |
| Anglo-Saxon | C, D, E, F, G, A, B | C4, D#5, Bb3… |
-al.Jazz : C4 E4 G4 Bb4 |
| Indian (tabla) | dha, dhin, tin, ta… | dha, dhin, tin… |
-al.Tabla : dha dhin tin ta tirakita |
| Indian (sargam) | Sa, Re, Ga, Ma, Pa, Dha, Ni | Sa, Re, Ga… |
-al.Raga : Sa Re Ga Ma Pa Dha Ni |
The three notation systems are strictly equivalent from a formal point of view: they are all finite alphabets of terminal symbols. BP3 is cross-cultural by design — it was conceived from the outset to handle tabla bols as well as Western notes [Bel1998].
The |x| Notation for Length
In formal language theory, the notation |x| (with a single lowercase symbol) denotes the length of a string.
|abc| = 3 (three symbols)
|ε| = 0 (ε = empty string, zero symbols)
|aabb| = 4
Caution: Do not confuse with BP3!
|S|in BP3 = the non-terminal S|w|in theory = the length of string w
Context generally helps resolve ambiguity.
Notation Conventions Recapitulated
Sidebar: Greek Letters in Language Theory
Theoretical computer scientists often use Greek letters by convention:
- ε (epsilon — empty string): the string of zero length (zero characters)
- α, β, γ (alpha, beta, gamma — arbitrary strings): sequences mixing terminals and non-terminals
- Σ (uppercase sigma — terminal alphabet): the set of terminal symbols
- λ (lambda — empty string): sometimes used as an alternative to ε
| Element | Standard Notation | BP3 Notation | Explanation |
|---|---|---|---|
| Terminal | a, b, +, "word" |
C4, -, alphabet symbols |
The final basic building blocks |
| Non-terminal | S, A, NP, Expression |
|S|, |A|, |Phrase| | The intermediate variables |
| String of terminals | w, x, y |
— | A sequence of final symbols |
| Mixed string | α, β, γ |
— | Can contain terminals AND non-terminals |
| Empty string | ε or λ |
_epsilon() or lambda |
The string of zero length |
| Length | |w| | — | Number of symbols in w |
| Terminal alphabet | Σ or T |
-al.* file |
The set of terminals |
| Non-terminal alphabet | N or V_N |
|…| symbols in rules | The set of non-terminals |
Why is the Distinction Important?
1. To Know When You Are Done
A derivation is complete when no non-terminals remain. If you still have an |A| in your output, it is not a valid result.
2. To Define the Generated Language
The language of a grammar G is the set of all strings of terminals that can be derived from the start symbol.
Sidebar: Decoding Mathematical Notation
The formula
L(G) = { w in T* | S =>* w }reads:
L(G): the language (L) of grammar GT*: the set of ALL possible strings formed with terminals, including the empty string. The asterisk (*) means “zero or more repetitions”.- If T = {a, b}, then T* = {ε, a, b, aa, ab, ba, bb, aaa, …}
S =>* w: S derives into w in zero or more steps. The*after => means “zero or more rule applications”.{ ... | ... }: set notation.{ x | condition }means “the set of all x that satisfy the condition”.So in English: “L(G) is the set of all strings w made solely of terminals, such that one can go from S to w by applying rules.”
3. To Implement a Parser
A parser (syntactic analyzer) is a program that takes a string of symbols as input and determines its grammatical structure. For example, a parser for arithmetic expressions takes “2 + 3 * 4” and builds a tree showing that multiplication has precedence over addition.
A token is a basic lexical unit recognized by the parser: a word, a number, an operator, etc. Tokens are the terminals of the grammar.
The parser must distinguish input tokens (terminals) from the syntactic categories it constructs (non-terminals). Confusing the two leads to subtle bugs.
4. For BP3: Distinguishing What Plays from What Structures
In BP3, terminals produce sound (notes, rests). Non-terminals organize the structure but produce nothing directly. This distinction is fundamental for the BP2SC transpiler (source-to-source converter) B7, which must generate playable I3 code.
Key Takeaways
- Terminals are the final symbols that appear in the output. In music: notes. In programming: tokens.
- Non-terminals are intermediate variables that structure the generation. They are all replaced before the end.
- The alphabet is the set of all symbols (terminals + non-terminals).
- BP3 uses the
|symbol|convention for non-terminals, which makes them visually distinct. - The
|x|notation in formal theory denotes length, not a non-terminal.
Further Reading
- Reference Book: Hopcroft, Motwani & Ullman, Introduction to Automata Theory, Languages, and Computation, Chapter 5
- BP3 Documentation: Bol Processor – Pattern Grammars
- Interactive Tutorial: CFG Developer
- The Genesis of the Bol Processor: Kippen, J. & Bel, B. (1989), “The Identification and Modelling of a Percussion ‘Language'” — how bols became a formal alphabet
- The Tabla of Lucknow: Kippen, J. (1988), The Tabla of Lucknow: A Cultural Analysis of a Musical Tradition, Cambridge University Press — the ethnomusicological context in which BP was created
- BP Overview: Bel, B. (1998), “Migrating Musical Concepts — An Overview of the Bol Processor”, Computer Music Journal 22(2) — multi-notation and cross-cultural migration
Glossary
- Alphabet (V or Σ): Finite set of symbols usable in a grammar. Σ often denotes terminals alone, V the complete set.
- Bol (बोल): Mnemonic syllable of the tabla (e.g.,
dha,dhin,tin,ta). The “Bol” in “Bol Processor”. Each bol corresponds to a specific stroke producing a distinct timbre. - String (or word): Finite sequence of symbols. Example: “abc” is a string of 3 symbols.
- Empty string (ε): The string of zero length, containing no symbols. Noted ε (epsilon) or λ (lambda).
- Derivation: Sequence of non-terminal replacements until only terminals are obtained.
- Disjoint: Two sets are disjoint if they have no elements in common.
- Kleene star (): Operation on a set that produces all possible strings formed with its elements, including the empty string. If T = {a}, then T = {ε, a, aa, aaa, …}.
- Language: Set of all terminal strings derivable from the start symbol.
- Non-terminal: Intermediate symbol (variable) that will be replaced during derivation. By convention, in uppercase.
- Parser (syntactic analyzer): Program that analyzes a string to determine its grammatical structure.
- Determined Rest: Rest of fixed duration, noted
-in BP3. Corresponds to an explicit pause of one unit of time. - Undetermined Rest: Rest of duration calculated by context, noted
_in BP3. Used in polymetric expressions. - Prolonged Rest (UndeterminedRest): Rest of indefinite duration, noted
...in BP3. Equivalent to a fermata (hold). - Start Symbol: The initial non-terminal from which all derivations begin (often S for “Sentence” or “Start”).
- Theka: Basic pattern of a tāla (Indian rhythmic cycle), played by the tabla in ostinato. The tintāl theka, for example, is a fixed sequence of 16 bols.
- Tāla: Rhythmic cycle structuring Indian classical music (e.g., tintāl = 16 beats, jhaptāl = 10 beats).
- Terminal: Symbol that appears in the final output and cannot be rewritten. By convention, in lowercase.
- Token: Basic lexical unit (word, number, operator) recognized by a parser. Corresponds to the grammar’s terminals.
- Union (∪): Set operation that combines two sets. A ∪ B contains all elements of A and all elements of B.
- Vibhāg: Section of a tāla. The tintāl has 4 vibhāg of 4 beats each.
Prerequisites: L2 (Context-Free Grammars), I2 (Bol Processor)
Reading time: 6 min
Tags: #grammars #alphabets #terminals #non-terminals #BP3 #formalism