M2) MusicXML Under the Formal Microscope

When XML Meets Musical Notation

MusicXML has a complete formal grammar — its XSD schema. Yet, it cannot generate music any more than MIDI can. How is this possible?

Where Does This Article Fit In?

In I5, we discovered how MusicXML works — tags, measures, notes, interoperability. In M1, we put MIDI under the formal languages microscope and found that it sits at the bottom of Chomsky’s hierarchy (Type 3, regular language). Here, we apply the same microscope to MusicXML.

The result is surprising: MusicXML is one level above MIDI in the hierarchy (Type 2, context-free). And most importantly, unlike MIDI, MusicXML possesses a complete formal grammar — its XSD (XML Schema Definition) schema. However, this formalization gives it no generative power musically. Understanding this paradox is fundamental to grasping why systems like BP3 are necessary.

MusicXML from the Perspective of Formal Languages

An XML Format = A Language → A Tree → A Tree Grammar

Any XML format can be analyzed as a formal language. It has:

An alphabet: the set of Unicode characters, structured into tags, attributes, and textual content.

A syntax: the rules that determine which documents are well-formed. For XML, the fundamental rule is the correct nesting of tags — every <note> must have its </note>.

A schema: additional constraints on the structure. For MusicXML, the XSD defines which elements can contain which other elements, with which attributes, and in what order.

The question is: what type of language is MusicXML? And its formal grammar (the XSD) — what exactly is it, in Chomsky’s classification (L1)?

The Alphabet: Tags, Attributes, Textual Content

Unlike MIDI, which operates on a binary alphabet of 256 bytes (M1), MusicXML operates on a structured textual alphabet:

Tags: <note>, <pitch>, <measure>, etc. — approximately 700 elements defined in the XSD

Attributes: id="P1", number="1", type="start" — qualify the elements

Textual content: C, 4, quarter — the values between tags

This rich alphabet is possible because XML is a text format, whereas MIDI is constrained by a binary format.

Level 1: XML = Type 2 (Context-Free) by Nature

Nested Tags = Recursion = Type 2

Any well-formed XML document belongs to a context-free language (Type 2 in Chomsky’s hierarchy, L1). The reason is structural and fundamental:

<score-partwise>
  <part>
    <measure>
      <note>
        <pitch>
          <step>C</step>
        </pitch>
      </note>
    </measure>
  </part>
</score-partwise>

Opening and closing tags form a system of nested parentheses. Every <note> must have its </note>, and elements nest within each other to arbitrary depths. This mechanism — nesting — is exactly what distinguishes context-free languages from regular languages.

[!Recall (L1)] : A regular language (Type 3) cannot handle nesting — it would require “counting” open parentheses to verify that they are all closed. A finite automaton has no memory for this. A pushdown automaton (the model for Type 2) pushes opening tags onto a stack and pops them off with each closing tag.

An XML Parser is a Pushdown Automaton

Parsing an XML document is exactly that of a context-free language:

Encounters <note> → pushes “note”

Encounters <pitch> → pushes “pitch”

Encounters </pitch> → pops, verifies it is “pitch”

Encounters </note> → pops, verifies it is “note”

If the stack is empty at the end and no errors have occurred, the document is well-formed.

Contrast with MIDI

MIDI is a stream of flat events — like reading a list. MusicXML is a tree — like navigating an organizational chart. This structural difference translates directly into Chomsky’s hierarchy: MIDI at Type 3, MusicXML at Type 2.

Level 2: XSD = A Complete Formal Grammar

The MusicXML XSD IS a Formal Specification

Here is the most striking fact when comparing MIDI and MusicXML from a formal perspective:

MIDI: after 40 years of existence, there is no formal grammar for the protocol (M1). Some semi-formal fragments (pseudo-BNF for SMF, ABNF grammar in RFC 6295), but nothing complete.

MusicXML: the XSD is a complete formal grammar. It defines exactly which documents are valid — which elements can contain which others, in what order, with what constraints.

This is a striking contrast. Thanks to the choice of XML as the underlying format, MusicXML automatically inherits a rigorous formal specification framework — the XML Schema.

XSD = A Regular Tree Grammar

The foundational work by Murata, Lee & Mani (2005) established a formal classification of XML schema languages:

DTD (the older validation format) ≈ local tree language: the type of an element depends only on its name, not its position in the tree

XSD (the current format) ≈ single-type tree language: the type of an element depends on its name and the type of its parent

RELAX NG (used by MEI) ≈ regular tree language: the type can depend on the complete path from the root — the most expressive

The MusicXML XSD is therefore a regular tree grammar in the sense of formal language theory. Each complex type in the XSD (like note, pitch, measure) defines a production: a parent node and the sequence of allowed children.

partwise vs timewise: Two Views of the Same Tree

MusicXML offers two possible root elements — score-partwise and score-timewise (I5). Formally, these are two different tree grammars that describe the same set of musical scores. The XSLT transformation between the two is a tree-to-tree transformation — a well-defined operation in tree language theory.

It’s like having two grammars that generate the same language: the words are the same, only the derivation structure changes.

XSD Validation = Recognition by Tree Automaton

Validating a MusicXML document against its XSD is asking: “Does this document belong to the language defined by the tree grammar?” The answer is given by a deterministic tree automaton — a machine that traverses the tree node by node and verifies that each node respects the constraints of its type.

The XSD imposes a constraint called UPA (Unique Particle Attribution): each child element must be assignable to exactly one particle of the content model without lookahead. This guarantees deterministic parsing — the tree automaton never needs to backtrack.

Level 3: Musical Content = Zero Generative Power

The Same Conclusion as for MIDI

Crucial point — and identical result to M1: MusicXML as a musical representation has no generative power.

A MusicXML file describes one specific score — this sonata, this measure, this note. It cannot say:

“Play this motif 3 times with variation” (the 3 occurrences must be written out)

“Randomly choose between two harmonizations” (no stochasticity)

“Superimpose these two streams in a 3:4 ratio” (no native polymetry)

“Define a phrase as a sequence of recombinable motifs” (no musical variables)

In terms of formal languages, a MusicXML file is a word of the language, not a grammar. It describes a result, not a process of generation.

The “Word vs. Grammar” Parallel (as for MIDI)

The MusicXML Paradox: Formalized but Not Generative

Two Formats, Two Gaps, Same Result

Let’s compare the situations:

MIDI: no formal grammar for the protocol + no musical generative power

MusicXML: complete formal grammar (XSD) + still no musical generative power

The syntactic formalization of the format confers no expressive power to the musical content. The XSD can say “a <note> element must contain a <pitch> or a <rest/>“, but it cannot say “a dominant chord resolves to the tonic” or “repeat this motif in augmentation”.

Why?

Because the XSD formalizes the container (the file structure), not the content (the musical relationships). It’s the difference between:

Defining the rules of a language’s syntax (French grammar)

Writing a text in that language (a novel)

The XSD is a grammar of the MusicXML format. A MusicXML file is a text written in that format. But neither the XSD nor the file is a grammar of music itself.

Syntactic Formalization Does Not Grant Musical Expressive Power

This point is fundamental for the project: formally, the situation is as follows:

Comparison: The Three Formats Studied

The lesson: formalizing an exchange format is not enough — a musical specification language is needed to achieve generative power. MusicXML has a complete formal grammar and yet is no more generative than MIDI.

Other musical formats (MEI, ABC, LilyPond, **kern…) deserve the same formal analysis. This is the subject of M8.

Contrast with BP3

Format vs. Language

The MusicXML → BP3 import (documented on the Bol Processor website) is a transformation from a word into a grammar: score fragments become manipulable, reorganizable, combinable variables. It is the transition from a static description to a dynamic specification.

Key Takeaways

MusicXML is a context-free language (Type 2): nested tags require a pushdown automaton, unlike MIDI (Type 3, finite automaton).

The MusicXML XSD is a complete formal grammar: unlike MIDI, which has never been formally specified, MusicXML has a rigorous specification thanks to XML Schema.

XSD = regular tree grammar: the work of Murata, Lee & Mani (2005) classifies XML schemas within tree language theory.

MusicXML has no musical generative power: despite its syntactic formalization, it describes a fixed score, not a generation process. It is a word, not a grammar.

The paradox: the formalization of the format (container) does not confer expressive power to the content (music). MIDI without a grammar and MusicXML with a grammar are equally non-generative.

BP3 operates at a fundamentally different level: where MusicXML formalizes the exchange format, BP3 formalizes the musical generation process.

Glossary

Pushdown automaton: a machine that extends the finite automaton with a stack (LIFO — Last In, First Out memory). It recognizes exactly context-free languages (Type 2). An XML parser is a pushdown automaton. See L1.

Tree automaton: a machine that recognizes trees (rather than strings). Used for validating XML documents against a schema.

Context-free: a type of grammar (Type 2) where each rule replaces a single symbol, independently of the context. It allows recursion and nesting — exactly what XML does. See L2.

DTD (Document Type Definition): an older XML validation format, corresponding to a local tree language. Superseded by XSD since MusicXML 4.0.

Tree grammar: a grammar that operates on trees rather than character strings. XML schemas are tree grammars.

Nesting: a structural property of context-free languages — an element can contain other elements of the same type, to arbitrary depth. This is the key to Type 2.

RELAX NG: an XML validation format more expressive than XSD, corresponding to a complete regular tree language. Used by MEI.

UPA (Unique Particle Attribution): an XSD constraint that enforces deterministic parsing — each child element corresponds to exactly one particle of the content model.

XSD (XML Schema Definition): the W3C’s XML validation format. For MusicXML, the XSD constitutes a complete formal specification of the format.

Links in the Series

Introduction Series (prerequisites):

I5 — Introduction to MusicXML: The Universal Digital Score

Formal Languages Series (theoretical tools):

L1 — Chomsky Hierarchy: The Classification Framework Used Here

L2 — Context-Free Grammars: The Level at Which MusicXML Sits

L3 — EBNF: The Formal Notation for Grammars

Music Series:

M1 — MIDI Under the Formal Microscope (Type 3 vs. Type 2 Contrast)

M3 (upcoming) — The Three Paradigms of Musical Representation

M8 — Symbolic Formats Beyond MIDI and MusicXML

BP3 Series:

B1 — Probabilistic Grammars: BP3 Starts at Type 2

Glossary:

Glossaire — General Glossary for the Series

Prerequisites: I5, L1
Reading time: 15 min
Tags: #musicxml #formal-languages #chomsky #xml-schema #tree-grammar #formalization

Next article: M3 (coming soon) — The Three Paradigms of Musical Representation