B2) Alphabets, Terminals, and Non-terminals
The Vocabulary of Grammars
When we talk about “formal grammar,” what exactly are we talking about? It all starts with three fundamental sets.
Where Does This Article Fit In?
This article details the basic vocabulary of context-free grammars L2 applied to BP3. What are the terminals (notes, rests) and non-terminals (structures) of I2? Understanding this vocabulary is essential for reading B3.
Why Is It Important?
Before writing a single grammar rule, you must define what you are working with. It is like a board game: before playing, you must know the available pieces and their roles.
In a formal grammar, three types of objects interact:
- Terminal symbols: the basic building blocks that appear in the final result
- Non-terminal symbols: the intermediate variables that structure the generation
- The alphabet: the set of all available symbols
Understanding this distinction is essential for reading, writing, and debugging grammars. Without it, we confuse the means (non-terminals) and the ends (terminals).
The Idea in One Sentence
The alphabet is the set of basic symbols; terminals are what we produce in the end; non-terminals are the intermediate labels that we replace until only terminals remain.
Let’s Explain Step by Step
Terminal Symbols: The Final Product
Terminals are the symbols that appear in the final output of the grammar. They can no longer be “rewritten”: they are there to stay.
Examples by domain:
| Domain | Terminals |
|---|---|
| Natural language | words: “cat”, “eats”, “the” |
| Arithmetic | digits and operators: 0, 1, +, *, (, ) |
| Programming | tokens: if, while, =, identifiers, numbers |
| Music (Western) | notes and rests: do, re, mi, - |
| Music (tabla) | bols (percussion syllables): dha, dhin, tin, ta |
Conventional notation:
- Lowercase letters:
a,b,c - Or strings in quotes:
"if","while" - Or special symbols:
+,*,(
Non-terminal Symbols: Working Variables
Non-terminals are intermediate symbols. They represent “categories” or “concepts” that will eventually be replaced by terminals (or other non-terminals, which themselves will be replaced).
Examples:
| Domain | Non-terminals |
|---|---|
| Natural language | S (sentence), NP (noun phrase), VP (verb phrase) |
| Arithmetic | E (expression), T (term), F (factor) |
| Programming | Statement, Expression, Block |
| Music | Phrase, Motif, Cadence |
Conventional notation:
- Uppercase letters:
S,A,B,NP - Or whole words with a capital letter:
Phrase,Expression - In BP3: between vertical bars
|S|,|A|
The Alphabet: All Available Vocabulary
The alphabet is the set of all symbols that the grammar can manipulate.
Sidebar: Mathematical Notations
In formal language theory, standard mathematical notations are used:
- Σ (capital sigma): often denotes the terminal alphabet
- V: denotes the complete alphabet (terminals + non-terminals)
- ∪ (union): the operation that combines two sets. If A = {1, 2} and B = {3, 4}, then A ∪ B = {1, 2, 3, 4}
- ∩ (intersection): the common elements of two sets
- ∅ (empty set): the set that contains no elements
- Two sets are disjoint if their intersection is empty: T ∩ N = ∅
The complete alphabet V is the union of terminals (T) and non-terminals (N):
V = T ∪ N
Concrete example:
- T = {a, b, c} (the terminals)
- N = {S, A, B} (the non-terminals)
- V = {a, b, c, S, A, B} (the complete alphabet)
Important constraint:
- T ∩ N = ∅: no symbol is both terminal AND non-terminal
- This allows one to always know if a symbol still needs to be replaced (non-terminal) or if it is final (terminal)
Example 1: A Simple Arithmetic Grammar
Let’s define a grammar for simple arithmetic expressions like 2+3*4.
Terminals (T) — the symbols that appear in the final result:
T = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, *, (, ) }
|___________________________| |__| |____|
digits operators parentheses
Non-terminals (N) — intermediate syntactic categories:
N = { E, T, F }
Meaning:
- E = Expression (a complete expression, e.g.: "2+3*4")
- T = Term (a product, e.g.: "3*4")
- F = Factor (a basic element, e.g.: "4" or "(2+3)")
Complete alphabet:
V = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, *, (, ), E, T, F }
|_________________________________________| |_____|
terminals non-terminals
Rules:
E → E + T | T # An expression is either E+T, or just T
T → T * F | F # A term is either T*F, or just F
F → ( E ) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 # A factor is (E) or a digit
Sidebar: Reading the
|NotationThe vertical bar
|means “or”. ThusE → E + T | Tis a shortcut for two rules:
E → E + TE → T
Derivation of 2+3*4:
E
→ E + T (E → E + T)
→ T + T (E → T)
→ F + T (T → F)
→ 2 + T (F → 2)
→ 2 + T * F (T → T * F)
→ 2 + F * F (T → F)
→ 2 + 3 * F (F → 3)
→ 2 + 3 * 4 (F → 4)
The final result 2+3*4 contains only terminals. All non-terminals (E, T, F) have been replaced.
Example 2: BP3 Notation
BP3 uses a clear visual convention to distinguish terminals and non-terminals.
BP3 Non-terminals: between vertical bars
|S| -- start symbol
|A| -- any non-terminal
|Phrase| -- descriptive name
BP3 Terminals:
- Musical notes:
do3,re3,C4,D#5 - Tabla bols:
dha,dhin,tin,ta,tirakita(symbols defined in the alphabet file) - Rests:
-(absence of event, occupies time) - Prolongations:
_(extends the previous event — no new attack, the sound continues) - Indeterminate rests:
...(duration calculated by the PolyMake engine to balance polymetric voices) - Symbols defined in the external alphabet (
-al.file)
Sidebar: Three Types of Silence in BP3
BP3 distinguishes three ways of “playing nothing”:
Notation Type Duration Usage -Rest Fixed (1 unit) Absence of event, occupies time _Prolongation Extends the previous one Sound continues, no new attack ...Indeterminate rest Calculated by PolyMake Optimal duration to balance polymetric voices >
The rest (-) creates a pause of a fixed duration, like a quarter rest in musical notation. The prolongation (_) extends the previous event without a new attack — the sound continues. The indeterminate rest (...) lets the PolyMake engine (B13) calculate the duration so that polymetric voices are balanced.
Example of a BP3 grammar:
-gr.MiniMelody
gram#1[1] |S| → |A| |B|
gram#1[2] |A| → do3 re3 mi3
gram#1[3] |A| → mi3 re3 do3
gram#1[4] |B| → - fa3 sol3
Identification:
- Non-terminals:
|S|,|A|,|B| - Terminals:
do3,re3,mi3,fa3,sol3,-
A possible derivation:
|S|
→ |A| |B| (rule 1)
→ do3 re3 mi3 |B| (rule 2)
→ do3 re3 mi3 - fa3 sol3 (rule 4)
Result: a sequence of six elements, all terminals.
The original alphabet — a tabla bol file:
Historically, the first BP alphabet was not do3, re3, mi3 but Indian percussion syllables. Here is what a tabla alphabet file (-al.Tabla) looks like:
-al.Tabla
dha dhin tin ta tirakita ga ghe ki na tun ra
Each symbol is a bol — a specific stroke on the tabla with its own timbre:
- Resonant bols (struck with the palm on the bāyāṅ, the bass drum):
dha,dhin,ghe,ga - Dry bols (struck with the fingertips on the dāyāṅ, the treble drum):
tin,ta,ki,na,tun,ra - Combined bols (fast sequences):
tirakita(= ti-ra-ki-ta, four strokes in acceleration)
A tabla grammar would use these bols exactly as our melodic grammar uses do3 and re3:
-gr.Tintal_Theka
gram#1[1] |S| → |Vibhag1| |Vibhag2| |Vibhag3| |Vibhag4|
gram#1[2] |Vibhag1| → dha dhin dhin dha
gram#1[3] |Vibhag2| → dha dhin dhin dha
gram#1[4] |Vibhag3| → dha tin tin ta
gram#1[5] |Vibhag4| → ta dhin dhin dha
This grammar generates the theka (basic pattern) of tintāl, the most common Indian rhythmic cycle (16 beats in 4 groups of 4). The non-terminals |Vibhag1|…|Vibhag4| structure the four sections of the cycle; the terminals dha, dhin, tin, ta are the bols played.
Three types of musical terminals:
| System | Notation | BP3 Terminals | Alphabet Example (-al.*) |
|---|---|---|---|
| French | do, re, mi, fa, sol, la, si | do3, re3, mi3… |
Display convention — -al. files use internal Anglo-Saxon notation |
| Anglo-Saxon | C, D, E, F, G, A, B | C4, D#5, Bb3… |
-al.Jazz: C4 E4 G4 Bb4 |
| Indian (tabla) | dha, dhin, tin, ta… | dha, dhin, tin… |
-al.Tabla: dha dhin tin ta tirakita |
| Indian (sargam) | Sa, Re, Ga, Ma, Pa, Dha, Ni | Sa, Re, Ga… |
-al.Raga: Sa Re Ga Ma Pa Dha Ni |
The three notation systems are strictly equivalent from a formal point of view: they are all finite alphabets of terminal symbols. BP3 is cross-cultural by design — it was designed from the outset to handle tabla bols as well as Western notes [Bel1998].
The |x| Notation for Length
In formal language theory, the notation |x| (with a single lowercase symbol) denotes the length of a string.
|abc| = 3 (three symbols)
|ε| = 0 (ε = empty string, zero symbols)
|aabb| = 4
Be careful not to confuse this with BP3!
|S|in BP3 = the non-terminal S|w|in theory = the length of the string w
Context usually allows for resolving the ambiguity.
Summary of Notation Conventions
Sidebar: Greek Letters in Language Theory
Theoretical computer scientists often use Greek letters by convention:
- ε (epsilon — empty string): the string of length zero (zero characters)
- α, β, γ (alpha, beta, gamma — arbitrary strings): sequences mixing terminals and non-terminals
- Σ (capital sigma — terminal alphabet): the set of terminal symbols
- λ (lambda — empty string): sometimes used as an alternative to ε
| Element | Standard Notation | BP3 Notation | Explanation |
|---|---|---|---|
| Terminal | a, b, +, "word" |
do3, -, alphabet symbols |
The final building blocks |
| Non-terminal | S, A, NP, Expression |
\ | S\ |
| String of terminals | w, x, y |
— | A sequence of final symbols |
| Mixed string | α, β, γ |
— | Can contain terminals AND non-terminals |
| Empty string | ε or λ |
_epsilon() or lambda |
The string of length zero |
| Length | \ | w\ | — |
| Terminal alphabet | Σ or T |
-al.* file |
The set of terminals |
| Non-terminal alphabet | N or V_N |
\ | …\ |
Why Is the Distinction Important?
1. To know when you are finished
A derivation is complete when no non-terminals remain. If you still have an |A| in your output, it is not a valid result.
2. To define the generated language
The language of a grammar G is the set of all terminal strings that can be derived from the start symbol.
Sidebar: Decoding Mathematical Notation
The formula
L(G) = { w in T* | S =>* w }reads as:
L(G): the language (L) of grammar GT*: the set of ALL possible strings formed with terminals, including the empty string. The asterisk (*) means “zero or more repetitions”.- If T = {a, b}, then T* = {ε, a, b, aa, ab, ba, bb, aaa, …}
S =>* w: S derives w in zero or more steps. The*after => means “zero or more rule applications”.{ ... | ... }: set notation.{ x | condition }means “the set of all x that satisfy the condition”.So in English: “L(G) is the set of all strings w made only of terminals, such that one can go from S to w through rule applications.”
3. To implement a parser
A parser (syntax analyzer) is a program that takes a string of symbols as input and determines its grammatical structure. For example, a parser for arithmetic expressions takes “2 + 3 * 4” and builds a tree showing that multiplication has priority over addition.
A token is a basic lexical unit recognized by the parser: a word, a number, an operator, etc. Tokens are the terminals of the grammar.
The parser must distinguish input tokens (terminals) from the syntactic categories it builds (non-terminals). Confusing the two leads to subtle bugs.
4. For BP3: distinguishing what plays from what structures
In BP3, terminals produce sound (notes, rests). Non-terminals organize the structure but do not produce anything directly. This distinction is fundamental for the BP2SC transpiler (source-to-source code converter) B7 which must generate playable I3 code.
Key Takeaways
- Terminals are the final symbols that appear in the output. In music: notes. In programming: tokens.
- Non-terminals are intermediate variables that structure the generation. They are all replaced before the end.
- The alphabet is the set of all symbols (terminals + non-terminals).
- BP3 uses the
|symbol|convention for non-terminals, making them visually distinct. - The
|x|notation in formal theory denotes length, not a non-terminal.
To Go Further
- Reference book: Hopcroft, Motwani & Ullman, Introduction to Automata Theory, Languages, and Computation, Chapter 5
- BP3 Documentation: Bol Processor – Pattern Grammars
- Interactive tutorial: CFG Developer
- The genesis of the Bol Processor: Kippen, J. & Bel, B. (1989), “The Identification and Modelling of a Percussion ‘Language'” — how bols became a formal alphabet
- The Lucknow Tabla: Kippen, J. (1988), The Tabla of Lucknow: A Cultural Analysis of a Musical Tradition, Cambridge University Press — the ethnomusicological context in which BP was created
- BP Overview: Bel, B. (1998), “Migrating Musical Concepts — An Overview of the Bol Processor”, Computer Music Journal 22(2) — multi-notation and cross-cultural migration
Glossary
- Alphabet (V or Σ): Finite set of symbols usable in a grammar. Σ often denotes terminals alone, V the complete set.
- Bol (बोल): Mnemonic syllable of the tabla (e.g.,
dha,dhin,tin,ta). The “Bol” in “Bol Processor”. Each bol corresponds to a specific stroke producing a distinct timbre. - String (or word): Finite sequence of symbols. Example: “abc” is a string of 3 symbols.
- Empty string (ε): The string of length zero, containing no symbols. Noted as ε (epsilon) or λ (lambda).
- Derivation: Sequence of non-terminal replacements until only terminals are obtained.
- Disjoint: Two sets are disjoint if they have no elements in common.
- Kleene star (): Operation on a set that produces all possible strings formed with its elements, including the empty string. If T = {a}, then T = {ε, a, b, aa, ab, ba, bb, aaa, …}.
- Language: Set of all terminal strings derivable from the start symbol.
- Non-terminal: Intermediate symbol (variable) that will be replaced during derivation. By convention, in uppercase.
- Parser (syntax analyzer): Program that analyzes a string to determine its grammatical structure.
- Determined rest: Fixed-duration rest, noted as
-in BP3. Corresponds to an explicit pause of one time unit. - Indeterminate rest: Rest with a duration calculated by the context, noted as
_in BP3. Used in polymetric expressions. - Prolonged rest (UndeterminedRest): Silence of indefinite duration, noted as
...in BP3. Equivalent to a fermata. - Start symbol: The initial non-terminal from which all derivation begins (often S for “Sentence” or “Start”).
- Theka: Basic pattern of a tāla (Indian rhythmic cycle), played by the tabla as an ostinato. The theka of tintāl, for example, is a fixed sequence of 16 bols.
- Tāla: Rhythmic cycle structuring Indian classical music (e.g., tintāl = 16 beats, jhaptāl = 10 beats).
- Terminal: Symbol that appears in the final output and can no longer be rewritten. By convention, in lowercase.
- Token: Basic lexical unit (word, number, operator) recognized by a parser. Corresponds to the terminals of the grammar.
- Union (∪): Set operation that combines two sets. A ∪ B contains all elements of A and all elements of B.
- Vibhāg: Section of a tāla. Tintāl has 4 vibhāgs of 4 beats each.
Prerequisites: L2 (Context-free Grammars), I2 (Bol Processor)
Reading time: 6 min
Tags: #grammars #alphabets #terminals #non-terminals #BP3 #formalism