L3) EBNF
The Language for Describing Languages
How did the creators of Python define Python’s syntax? With a meta-language, of course. Welcome to the world of EBNF.
Where does this article fit in?
This article is part of the Formal Languages (L) series. It introduces the EBNF notation used to write Context-Free Grammars (CFG — see L2). This notation is directly used in the parser (syntactic analyzer) of the BP2SC project (BP3-to-SuperCollider, the transpiler that translates BP3 into SuperCollider — see B7), which uses Lark, a Python parser generator presented later in this article.
Why is it important?
You now know what a context-free grammar is. But how do you write these grammars in a clear, unambiguous way that can be read by automated tools?
The answer is EBNF (Extended Backus-Naur Form). It is the standard notation used in the official specifications of practically all programming languages. When you read “the Python specification” or “the JSON grammar,” you are reading EBNF (or a close variant).
What is a meta-language?
A meta-language is a language used to describe other languages. It is a higher level of abstraction.
- French can describe the rules of French: it is a meta-language for itself.
- EBNF describes the syntax of Python, JSON, SQL…: it is a meta-language for these languages.
Analogy: If an architect’s plan describes a building, then the technical drawing conventions (scale, symbols, legend) are the “meta-language” that allows one to read that plan.
For the BP2SC project, we wrote the complete specification of BP3 (Bol Processor 3, a musical grammar language — see I2) in EBNF. This 575-line file (bp3_ebnf.xml) is the normative reference: everything the transpiler (a program that translates a source language to another language of the same level) accepts must be covered by this grammar.
The idea in one sentence
EBNF is a standardized notation for writing context-free grammars, with practical operators to express repetitions, options, and alternatives.
Let’s explain step by step
From BNF to EBNF: a practical evolution
BNF (Backus-Naur Form, named after John Backus and Peter Naur) was invented in 1959 to describe the syntax of ALGOL 60 (ALGOrithmic Language, one of the first structured programming languages). It was revolutionary — for the first time, there was a formal notation to define a language — but the notation was verbose.
Why “Extended”?
The “E” in EBNF stands for Extended because EBNF adds operators to BNF without removing anything. Everything expressible in BNF is expressible in EBNF, but the reverse is not directly true.
Analogy: BNF is like a basic calculator (+, -, ×, ÷). EBNF adds advanced functions (square root, power…). You can do everything with the basic one, but it takes longer.
Example in pure BNF:
<liste> ::= <element>
<liste> ::= <liste> "," <element>
This grammar describes a comma-separated list of elements. But we need two rules to express “one or more elements.”
EBNF adds operators that make writing more concise:
liste = element { "," element } ;
Just one line! The curly braces { } mean “zero or more repetitions.”
EBNF operators
EBNF introduces three fundamental operators:
| Operator | Syntax | Meaning | Example |
|---|---|---|---|
| Option | [ x ] |
Zero or one time | [sign] number |
| Repetition | { x } |
Zero or more times | { digit } |
| Alternative | x \| y |
One or the other | "+" \| "-" |
Derived operators (common variants):
Some EBNF variants borrow more compact notations from regular expressions (regex, a formalism for describing text patterns):
| Operator | Meaning | Standard EBNF Equivalent | Origin |
|---|---|---|---|
x+ |
One or more times | x { x } |
Regex |
x? |
Zero or one time (optional) | [ x ] |
Regex |
x* |
Zero or more times | { x } |
Regex |
Reading these symbols
These symbols come from regular expressions (regex), another formalism for describing text patterns:
+suggests “at least one” (like a plus sign that adds)?suggests “maybe” (like a question)*suggests “any number” (like a star that shines multiple times)
Example 1: Integers with optional sign
In BNF (5 rules):
<nombre> ::= <signe> <entier>
<nombre> ::= <entier>
<entier> ::= <chiffre>
<entier> ::= <entier> <chiffre>
<signe> ::= "+" | "-"
In EBNF (2 rules):
nombre = [ "+" | "-" ] chiffre { chiffre } ;
chiffre = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
The EBNF version is more readable and closer to intuition: “a number is an optional sign followed by one or more digits.”
Notation conventions
Several EBNF variants exist. Here are the most common conventions:
ISO 14977 Notation (official standard):
rule = definition ; (* final semicolon *)
"terminal" (* quotes for terminals *)
non_terminal (* no angle brackets *)
(* comment *) (* between parenthesis-asterisks *)
W3C Notation (World Wide Web Consortium, used for XML, HTML):
rule ::= definition (* double colon *)
'terminal' (* single quotes also *)
[A-Z] (* character classes *)
What is a character class?
The notation
[A-Z]means “any character from A to Z” (the 26 uppercase letters).Common examples:
[a-z]: lowercase letters[0-9]: digits[a-zA-Z]: lowercase or uppercase letters[a-zA-Z0-9_]: alphanumeric characters and underscoreThis is a compact notation borrowed from regular expressions.
Informal Notation (common in documentation):
rule → definition (* single arrow *)
<non_terminal> (* angle brackets *)
Example 2: Excerpts from the BP3 grammar
Here are real examples taken from our EBNF specification for BP3:
Structure of a BP3 file:
<production name="bp_file">
<rule>header_line* grammar_section+</rule>
</production>
Translation: “A BP3 file consists of zero or more header lines, followed by one or more grammar sections.”
A BP3 production rule:
<production name="rule_line">
<rule>rule_prefix? weight? flag* lhs "-->" rhs trailing_comment?</rule>
</production>
Translation: “A rule line optionally contains a prefix (gram#1[1]), optionally a weight (<50>), zero or more flags (/Ideas/), a left-hand side, the arrow -->, a right-hand side, and optionally a comment.”
Notes in French notation:
<production name="note_french">
<rule>note_name_fr accidental? digit</rule>
</production>
<production name="note_name_fr">
<rule>"do" | "ré" | "mi" | "fa" | "sol" | "la" | "si"</rule>
</production>
<production name="accidental">
<rule>"#" | "b"</rule>
</production>
Translation: “A French note is a note name (do, ré, etc.), an optional accidental (# or b), and an octave digit.”
Valid examples: do4, ré#5, sib3, fa4
Polymetric expressions:
<production name="polymetric">
<rule>"{" polymetric_body "}"</rule>
</production>
<production name="poly_ratio">
<rule>integer "," rhs_element+</rule>
</production>
Translation: “A polymetric expression is enclosed in curly braces. The poly_ratio form is an integer, a comma, then one or more elements.”
Example: {2, do ré mi} — syntactically: an integer followed by a comma and notes.
Reading an EBNF specification
When consulting a language’s documentation, you will often encounter its EBNF grammar. Here’s how to read it:
Example: The JSON grammar (simplified)
json = value ;
value = object | array | string | number | "true" | "false" | "null" ;
object = "{" [ member { "," member } ] "}" ;
member = string ":" value ;
array = "[" [ value { "," value } ] "]" ;
string = '"' { caractère } '"' ;
number = [ "-" ] entier [ fraction ] [ exposant ] ;
Step-by-step reading of object:
"{": starts with an opening curly brace[...]: optionally…member: a member (key:value){ "," member }: followed by zero or more (comma + member)"}": ends with a closing curly brace
Result: {}, {"a": 1}, {"a": 1, "b": 2} are all valid.
EBNF and tools
EBNF is not just a theoretical notation. Many tools called parser generators can read an EBNF grammar and automatically produce the code for a functional parser:
What is a parser generator?
A parser generator is a program that takes a grammar as input and outputs source code capable of analyzing that grammar.
Analogy: It’s like a “grammar compiler.” You give it the rules of the language, and it gives you the program that understands that language.
| Tool | Target Language | Feature |
|---|---|---|
| ANTLR (ANother Tool for Language Recognition) | Java, Python, C#, etc. | Very popular, integrated IDE (ANTLRWorks) |
| Lark | Python | EBNF-like syntax, simple to use |
| PEG.js | JavaScript | PEG grammars (Parsing Expression Grammar, similar to CFGs but deterministic) |
| Bison/Yacc (Yet Another Compiler-Compiler) | C | Unix classics, very performant |
Example with Lark (Python):
from lark import Lark
grammaire = """
start: expression
expression: terme (("+" | "-") terme)*
terme: facteur (("*" | "/") facteur)*
facteur: NOMBRE | "(" expression ")"
NOMBRE: /[0-9]+/
"""
parser = Lark(grammaire)
arbre = parser.parse("2 + 3 * 4")
print(arbre.pretty())
Output:
start
expression
terme
facteur 2
terme
facteur 3
facteur 4
The limits of EBNF
EBNF describes syntax (form, structure), not semantics (meaning, significance — see L5). It can express:
- “A number is one or more digits” (form)
- “A function has parameters in parentheses” (structure)
It CANNOT express:
- “A number must be less than 2³¹” (value constraint)
- “A variable must be declared before use” (contextual rules)
- “The return type must match the declaration” (semantic consistency)
These checks require additional analysis, performed after parsing (syntactic analysis).
Key takeaways
- EBNF = BNF + practical operators: square brackets
[ ]for option, curly braces{ }for repetition, and the bar|for alternative. - More readable than BNF: an EBNF grammar is often 2 to 3 times shorter than its BNF equivalent.
- Industry standard: official specifications for Python, JSON, XML, SQL, and most languages use EBNF.
- Usable by tools: parser generators like ANTLR or Lark can read EBNF and automatically produce code.
- Syntax only: EBNF describes the form of the language, not its meaning. Semantics require complementary analyses.
To go further
- Official standard: ISO/IEC 14977:1996 – Syntactic metalanguage (EBNF)
- ANTLR tutorial: https://www.antlr.org/
- Lark documentation: https://lark-parser.readthedocs.io/
- Official Python grammar: https://docs.python.org/3/reference/grammar.html
Glossary
| Term | Definition |
|---|---|
| BNF | Backus-Naur Form. Original notation for formal grammars, invented in 1959 by John Backus and Peter Naur to describe ALGOL 60. |
| Character class | Notation [...] denoting a set of characters. [a-z] means “any lowercase letter.” Borrowed from regular expressions. |
| EBNF | Extended Backus-Naur Form. An extension of BNF adding operators for option [ ], repetition { }, and other syntactic conveniences. |
| Regular expression | A formalism for describing text patterns, using operators like *, +, ?. More limited than CFGs but more efficient to analyze. |
| Parser generator | A tool that takes a grammar as input and automatically produces the code for a parser. Examples: ANTLR, Lark, Bison. |
| ISO 14977 | International standard defining the official syntax of EBNF, published in 1996. |
| Meta-language | A language used to describe other languages. EBNF is a meta-language for Python, JSON, etc. |
| Non-terminal | An abstract symbol that will be replaced according to the grammar rules. |
| Parser | A program that analyzes a string of characters according to a grammar and extracts its structure. Synonym: syntactic analyzer. |
| Production | Synonym for “grammar rule.” Defines how a symbol breaks down into other symbols. |
| Terminal | A literal symbol of the final language (keywords, operators, values). Cannot be replaced. |
Prerequisite: L2
Reading time: 8 min
Tags: #ebnf #bnf #grammars #specification #parsing
Next article: L4