L3) EBNF

The Language for Describing Languages

How did the creators of Python define Python’s syntax? With a meta-language, of course. Welcome to the world of EBNF.

Where This Article Fits In

This article is part of the Formal Languages (L) series. It introduces the EBNF notation used to write context-free grammars (CFG — see L2). This notation is directly used in the parser (syntactic analyzer) of the BP2SC project (BP3-to-SuperCollider, the transpiler that translates BP3 to SuperCollider — see B7), which uses Lark, a Python parser generator presented later in this article.


Why It’s Important

You now know what a context-free grammar is. But how do you write these grammars in a clear, unambiguous way that can be read by automated tools?

The answer is EBNF (Extended Backus-Naur Form). It is the standard notation used in the official specifications of practically all programming languages. When you read “the Python specification” or “the JSON grammar,” you are reading EBNF (or a close variant).

What is a meta-language?

A meta-language is a language used to describe other languages. It is a higher level of abstraction.

  • French can describe the rules of French: it is a meta-language for itself
  • EBNF describes the syntax of Python, JSON, SQL…: it is a meta-language for these languages

Analogy: If an architect’s plan describes a building, then the technical drawing conventions (scale, symbols, legend) are the “meta-language” that allows one to read that plan.

For the BP2SC project, we wrote the complete specification of BP3 (Bol Processor 3, a language for musical grammars — see I2) in EBNF. This 575-line file (bp3_ebnf.xml) is the normative reference: everything the transpiler (a program that translates a source language to another language of the same level) accepts must be covered by this grammar.

The Idea in One Sentence

EBNF is a standardized notation for writing context-free grammars, with practical operators for expressing repetitions, options, and alternatives.


Let’s Explain Step by Step

From BNF to EBNF: A Practical Evolution

BNF (Backus-Naur Form, named after John Backus and Peter Naur) was invented in 1959 to describe the syntax of ALGOL 60 (ALGOrithmic Language, one of the first structured programming languages). It was revolutionary — for the first time, there was a formal notation to define a language — but the notation was verbose.

Why “Extended”?

The “E” in EBNF stands for Extended because EBNF adds operators to BNF without removing anything. Everything expressible in BNF is expressible in EBNF, but the reverse is not directly true.

Analogy: BNF is like a basic calculator (+, -, ×, ÷). EBNF adds advanced functions (square root, power…). You can do everything with the basic one, but it takes longer.

Example in pure BNF:

 

<liste> ::= <element>
<liste> ::= <liste> "," <element>

 

This grammar describes a list of comma-separated elements. But two rules are needed to express “one or more elements.”

EBNF adds operators that make writing more concise:

 

liste = element { "," element } ;

 

Just one line! The curly braces { } mean “zero or more repetitions.”

EBNF Operators

EBNF introduces three fundamental operators:

Operator Syntax Meaning Example
Option [ x ] Zero or one time [sign] number
Repetition { x } Zero or more times { digit }
Alternative x &#124; y One or the other "+" &#124; "-"

Derived Operators (common variants):

Some EBNF variants borrow more compact notations from regular expressions (regex, a formalism for describing text patterns) for more compact notations:

Operator Meaning Standard EBNF Equivalent Origin
x+ One or more times x { x } Regex
x? Zero or one time (optional) [ x ] Regex
x* Zero or more times { x } Regex

Reading these symbols

These symbols come from regular expressions (regex), another formalism for describing text patterns:

  • + suggests “at least one” (like a plus sign that adds)
  • ? suggests “maybe” (like a question)
  • * suggests “any number” (like a star that shines multiple times)

Example 1: Integers with Optional Sign

In BNF (5 rules):

 

<nombre>   ::= <signe> <entier>
<nombre>   ::= <entier>
<entier>   ::= <chiffre>
<entier>   ::= <entier> <chiffre>
<signe>    ::= "+" | "-"

 

In EBNF (2 rules):

 

nombre = [ "+" | "-" ] chiffre { chiffre } ;
chiffre = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;

 

The EBNF version is more readable and closer to intuition: “a number is an optional sign followed by one or more digits.”

Notation Conventions

There are several variants of EBNF. Here are the most common conventions:

ISO 14977 Notation (official standard):

 

règle = définition ;              (* final semicolon *)
"terminal"                        (* quotes for terminals *)
non_terminal                      (* no angle brackets *)
(* commentaire *)                 (* between parenthesis-asterisks *)

 

W3C Notation (World Wide Web Consortium, used for XML, HTML):

 

règle ::= définition              (* double colon *)
'terminal'                        (* single quotes also *)
[A-Z]                             (* character classes *)

 

What is a character class?

The notation [A-Z] means “any character from A to Z” (the 26 uppercase letters).

Common examples:

  • [a-z] : lowercase letters
  • [0-9] : digits
  • [a-zA-Z] : lowercase or uppercase letters
  • [a-zA-Z0-9_] : alphanumeric characters and underscore

This is a compact notation borrowed from regular expressions.

Informal Notation (common in documentation):

 

règle → définition               (* single arrow *)
<non_terminal>                    (* angle brackets *)

 

Example 2: Excerpts from the BP3 Grammar

Here are real examples taken from our EBNF specification for BP3:

Structure of a BP3 file:

 

<production name="bp_file">
    <rule>header_line* grammar_section+</rule>
</production>

 

Translation: “A BP3 file is zero or more header lines, followed by one or more grammar sections.”

A BP3 production rule:

 

<production name="rule_line">
    <rule>rule_prefix? weight? flag* lhs "-->" rhs trailing_comment?</rule>
</production>

 

Translation: “A rule line optionally contains a prefix (gram#1[1]), optionally a weight (<50>), zero or more flags (/Ideas/), a left-hand side, the arrow -->, a right-hand side, and optionally a comment.”

Notes in French notation:

 

<production name="note_french">
    <rule>note_name_fr accidental? digit</rule>
</production>

<production name="note_name_fr">
    <rule>"do" | "ré" | "mi" | "fa" | "sol" | "la" | "si"</rule>
</production>

<production name="accidental">
    <rule>"#" | "b"</rule>
</production>

 

Translation: “A French note is a note name (do, , etc.), an optional accidental (# or b), and an octave digit.”

Valid examples: do4, ré#5, sib3, fa4

Polymetric expressions:

 

<production name="polymetric">
    <rule>"{" polymetric_body "}"</rule>
</production>

<production name="poly_ratio">
    <rule>integer "," rhs_element+</rule>
</production>

 

Translation: “A polymetric expression is enclosed in curly braces. The poly_ratio form is an integer, a comma, then one or more elements.”

Example: {2, do ré mi} — syntactically: an integer followed by a comma and notes.

Reading an EBNF Specification

When you consult a language’s documentation, you will often encounter its EBNF grammar. Here’s how to read it:

Example: The JSON grammar (simplified)

 

json      = value ;
value     = object | array | string | number | "true" | "false" | "null" ;
object    = "{" [ member { "," member } ] "}" ;
member    = string ":" value ;
array     = "[" [ value { "," value } ] "]" ;
string    = '"' { caractère } '"' ;
number    = [ "-" ] entier [ fraction ] [ exposant ] ;

 

Step-by-step reading of object:

  1. "{" : starts with an opening curly brace
  2. [...] : optionally…
  3. member : a member (key:value)
  4. { "," member } : followed by zero or more (comma + member)
  5. "}" : ends with a closing curly brace

Result: {}, {"a": 1}, {"a": 1, "b": 2} are all valid.

EBNF and Tools

EBNF is not just a theoretical notation. Many tools called parser generators can read an EBNF grammar and automatically produce the code for a functional parser:

What is a parser generator?

A parser generator is a program that takes a grammar as input and outputs source code capable of analyzing that grammar.

Analogy: It’s like a “grammar compiler.” You give it the rules of the language, and it gives you the program that understands that language.

Tool Target Language Feature
ANTLR (ANother Tool for Language Recognition) Java, Python, C#, etc. Very popular, integrated IDE (ANTLRWorks)
Lark Python EBNF-like syntax, simple to use
PEG.js JavaScript PEG grammars (Parsing Expression Grammar, similar to CFGs but deterministic)
Bison/Yacc (Yet Another Compiler-Compiler) C Unix classics, very performant

Example with Lark (Python):

 

from lark import Lark

grammaire = """
    start: expression
    expression: terme (("+" | "-") terme)*
    terme: facteur (("*" | "/") facteur)*
    facteur: NOMBRE | "(" expression ")"
    NOMBRE: /[0-9]+/
"""

parser = Lark(grammaire)
arbre = parser.parse("2 + 3 * 4")
print(arbre.pretty())

 

Output:

start
  expression
    terme
      facteur       2
    terme
      facteur       3
      facteur       4

 

The Limitations of EBNF

EBNF describes syntax (form, structure), not semantics (meaning, significance — see L5). It can express:

  • “A number is one or more digits” (form)
  • “A function has parameters in parentheses” (structure)

It CANNOT express:

  • “A number must be less than 2³¹” (value constraint)
  • “A variable must be declared before use” (contextual rules)
  • “The return type must match the declaration” (semantic consistency)

These checks require additional analysis, performed after parsing (syntactic analysis).


Key Takeaways

  • EBNF = BNF + practical operators: square brackets [ ] for option, curly braces { } for repetition, and the bar | for alternative.
  • More readable than BNF: an EBNF grammar is often 2 to 3 times shorter than its BNF equivalent.
  • Industry standard: official specifications for Python, JSON, XML, SQL, and most languages use EBNF.
  • Usable by tools: parser generators like ANTLR or Lark can read EBNF and automatically produce code.
  • Syntax only: EBNF describes the form of the language, not its meaning. Semantics require complementary analyses.

To Go Further

  • Official Standard: ISO/IEC 14977:1996 – Syntactic metalanguage (EBNF)
  • ANTLR Tutorial: https://www.antlr.org/
  • Lark Documentation: https://lark-parser.readthedocs.io/
  • Official Python Grammar: https://docs.python.org/3/reference/grammar.html

Glossary

Term Definition
BNF Backus-Naur Form. Original notation for formal grammars, invented in 1959 by John Backus and Peter Naur to describe ALGOL 60.
Character Class Notation [...] designating a set of characters. [a-z] means “any lowercase letter.” Borrowed from regular expressions.
EBNF Extended Backus-Naur Form. An extension of BNF adding operators for option [ ], repetition { }, and other syntactic conveniences.
Regular Expression A formalism for describing text patterns, using operators like *, +, ?. More limited than CFGs but more efficient to analyze.
Parser Generator A tool that takes a grammar as input and automatically produces the code for a parser. Examples: ANTLR, Lark, Bison.
ISO 14977 International standard defining the official syntax of EBNF, published in 1996.
Meta-language A language used to describe other languages. EBNF is a meta-language for Python, JSON, etc.
Non-terminal An abstract symbol that will be replaced according to the grammar rules.
Parser A program that analyzes a string of characters according to a grammar and extracts its structure. Synonym: syntactic analyzer.
Production Synonym for “grammar rule.” Defines how a symbol breaks down into other symbols.
Terminal A literal symbol of the final language (keywords, operators, values). Cannot be replaced.

Prerequisite: L2
Reading time: 8 min
Tags: #ebnf #bnf #grammars #specification #parsing


Next article: L4