L9) Beyond Chomsky

Mildly Context-Sensitive Languages

Chomsky’s hierarchy presents four well-defined levels. But between Type 2 (context-free) and Type 1 (context-sensitive), there is an immense intermediate zone — and it is precisely there that the most useful formalisms for describing natural language and music reside. Welcome to the world of mildly context-sensitive languages.

Where does this article fit in?

In L1, we saw the four classic levels: regular, context-free, context-sensitive, recursively enumerable. In L2, we explored CFGs (Context-Free Grammars) in detail — their power, but also their limitations.

This article explores the territory that lies between CFGs and context-sensitive grammars. It’s not a no man’s land: it’s a rich space, populated by elegant formalisms that have transformed computational linguistics, natural language processing, and — as we will see — musical analysis.

Classic Hierarchy:

  Type 3 (regular)
    ⊂ Type 2 (context-free)
      ⊂ ??? ← WE ARE HERE
        ⊂ Type 1 (context-sensitive)
          ⊂ Type 0 (recursively enumerable)

Why is this important?

Natural language goes beyond CFGs

Consider this sentence in Dutch:

Ik zag Jan Piet Marie helpen leren zwemmen.
(I saw Jan help Piet learn Marie to swim.)

The structure of this sentence contains cross-serial dependencies. The subjects Jan, Piet, Marie are linked respectively to the verbs helpen, leren, zwemmen, in the same order:

Jan  ──────────────── helpen
  Piet  ──────────── leren
    Marie  ──────── zwemmen

This pattern — where two sequences of arbitrary length must correspond element by element — cannot be captured by a context-free grammar. This was demonstrated by Stuart Shieber in 1985 for Swiss German (a dialect exhibiting the same structures). CFGs are not sufficient for natural language.

But context-sensitive grammars (Type 1) are far too powerful: they can generate languages whose parsing (syntactic analysis — determining if a sequence belongs to the language and reconstructing its structure) is PSPACE-complete (a complexity class for which no efficient algorithm is generally known) — well beyond what is useful in practice.

In music

The situation is analogous. Certain musical structures go beyond CFGs:

Harmonic progressions with long-distance dependencies (resolution of tensions separated by digressions)
Polymetry: superposition of rhythmic patterns whose lengths must globally correspond
Canons and fugues: the copying structure (the subject is reproduced in different voices) resembles the language {ww}

We need something more expressive than context-free, but reasonably tractable. This is exactly what mildly context-sensitive languages provide.

In NLP (Natural Language Processing)

Today, most serious parsers in computational linguistics do not use pure CFGs. They use TAGs (Tree Adjoining Grammars), CCGs (Combinatory Categorial Grammars), or variants of MCFGs (Multiple Context-Free Grammars). All these formalisms live in the mildly context-sensitive zone.

The idea in one sentence

Mildly context-sensitive languages form a class of languages strictly more powerful than context-free but which remains parsable in polynomial time, capturing exactly the degree of complexity necessary for natural language and music.

Let’s explain step by step

The problem: what CFGs cannot do

Recall that context-free grammars can generate the language {aⁿbⁿ} — sequences of as followed by the same number of bs. The stack of a L1 (a machine that has a “last-in, first-out” memory in addition to its states) allows “counting” one thing at a time.

But what happens if we need to count three things simultaneously?

The language {aⁿbⁿcⁿ} — for example abc, aabbcc, aaabbbccc — is not context-free. This is a classic result demonstrable by the pumping lemma (a mathematical tool that proves a language is not context-free by showing it lacks certain repetition properties that every context-free language must have).

Deciphering:

A context-free grammar can “stack” symbols then “unstack” them, which allows checking that one quantity corresponds to another (like aⁿbⁿ). But it only has one stack. To simultaneously check that three quantities are equal, one would somehow need “two independent stacks” — and this is exactly what mildly context-sensitive formalisms provide (via the auxiliary trees of TAGs or the composite categories of CCGs).

Analogy: Imagine a parking lot with a single counter at the entrance (CFGs). Each car entering adds +1, each car leaving subtracts -1. At the end of the day, we check that the counter is at zero: as many entries as exits (aⁿbⁿ). That works. But now, the parking lot also requires each car to pay at the cashier before leaving — we need to check that the number of entries, cashier visits, and exits all correspond (aⁿbⁿcⁿ). A single counter is no longer enough: a second, independent one is needed. Mildly context-sensitive formalisms add this second counter — but not an unlimited number (that would be full context-sensitive grammars, far too powerful).

Joshi’s 4 criteria (1985)

In 1985, Aravind Joshi proposed an informal but influential characterization of what a “just powerful enough” grammatical formalism for natural language should be. He identified four criteria that a mildly context-sensitive formalism should satisfy:

1. Contain all context-free languages (CFL ⊂ MCS)
(CFL = Context-Free Languages, languages generatable by CFGs; MCS = Mildly Context-Sensitive)

The formalism must at least generate everything a CFG can generate. This is a floor: we want more, not less.

2. Capture cross-serial dependencies

The formalism must be able to generate languages like {aⁿbⁿcⁿ} and the cross-serial structures of Dutch and Swiss German.

3. Have the semilinearity property (constant growth)

Deciphering:

A language is semilinear if the length of its words grows in a “regular” manner — more precisely, if the set of Parikh vectors (counting the occurrences of each symbol) forms a semilinear set. In practice, this excludes languages whose growth is exponential or erratic.

In simple terms: the words of the language cannot have arbitrarily “capricious” lengths. If a word of length 10 exists, it’s not possible for the next valid word to have a length of 1000 without any in between.

4. Be parsable in polynomial time

The parsing problem must be solvable in polynomial time — typically O(n⁶) or better. The notation O(n⁶) means that the computation time grows at most as the sixth power of the input length n: doubling the length multiplies the time by approximately 64. This is crucial for practical applications.

Deciphering:

Full context-sensitive grammars (Type 1) are PSPACE-complete for parsing — essentially intractable for reasonably sized inputs. Mildly context-sensitive formalisms keep parsing polynomial, making them practically usable for analyzing sentences or scores.

These four criteria do not constitute a formal definition (Joshi himself spoke of an “informal characterization”), but they have guided research for decades and have proven remarkably fruitful.

TAGs (Tree Adjoining Grammars)

TAGs, introduced by Joshi in 1975, are the emblematic formalism of the mildly context-sensitive zone. The foundational idea: instead of rewriting strings (like CFGs), we directly manipulate trees.

The two types of trees

A TAG is defined by two sets of elementary trees:

Initial trees (alpha): complete trees with leaves marked by non-terminals (intermediate symbols to be replaced, like S, NP, VP — as opposed to terminals which are the concrete words of the final result) for substitution (noted ↓):

     S                    NP
    / \                  / \
  NP↓  VP              D    N
       / \             |    |
      V   NP↓        "le" "chat"
      |
   "mange"

Auxiliary trees (beta): trees with a special node called a foot node, marked by *, of the same type as the root:

      VP
     / \
   Adv   VP*
    |
  "souvent"

The two operations

Substitution: a leaf node ↓ is replaced by an initial tree of the corresponding type. This is analogous to derivation in a CFG.

Before:         S              NP tree:    NP
               / \                           / \
             NP↓  VP                        D    N
                  / \                       |    |
                 V   NP↓                  "le" "chat"
                 |
              "mange"

After substituting NP↓ (left):

                 S
                / \
              NP    VP
             / \   / \
            D   N  V  NP↓
            |   |  |
          "le""chat""mange"

Adjunction: an auxiliary tree is inserted at an internal node. The existing subtree is “grafted” to the foot node (*). This operation gives TAGs their additional power.

Before:         VP              Auxiliary tree:    VP
               / \                                   / \
              V   NP                               Adv   VP*
              |   |                                 |
           "mange""souris"                       "souvent"

After adjunction to VP:

                VP
               / \
             Adv   VP
              |   / \
          "souvent"V  NP
                   |   |
                "mange""souris"

Deciphering:

Adjunction is the key to TAGs’ power. It allows inserting material in the middle of an existing structure, not just at the leaves. This is what enables capturing long-distance dependencies and nested recursive structures that CFGs cannot express.

What TAGs can express

{aⁿbⁿcⁿdⁿ} — coordination of four counters
{ww} — the copy language (a word followed by its exact copy)
Cross-serial dependencies of Dutch/Swiss German

What TAGs cannot express

{www} — triple copy. This exceeds the power of TAGs.
Languages whose fan-out (number of independent components to coordinate) exceeds 2.

TAGs generate exactly TAL languages (Tree Adjoining Languages), a class strictly between CFLs (context-free languages) and CSLs (Context-Sensitive Languages, Type 1 context-sensitive languages). TAL is to TAGs what CFL is to CFGs: the class of languages generated by these grammars.

Parsing complexity: O(n⁶) — polynomial, conforming to Joshi’s criterion 4.

CCGs (Combinatory Categorial Grammars)

CCGs, developed primarily by Mark Steedman from the 1980s, adopt a radically different philosophy: no trees, no rewrite rules. Everything relies on types (categories) and combinators.

Categories as types

Each word in the lexicon is assigned one or more syntactic categories that directly encode how that word combines with others.

Basic categories are atomic types: S (sentence), NP (noun phrase), N (noun), etc.

Complex categories are constructed with the operators / and \:

X/Y: “I’m looking for a Y to my right to become an X”
X\Y: “I’m looking for a Y to my left to become an X”

Deciphering:

Think of categories as typed functions. A transitive verb like “mange” (eats) has the category (S\NP)/NP — meaning: “give me an NP on the right, then an NP on the left, and I’ll build you an S”. The slashes indicate direction: / looks right, \ looks left.

Example: “Jean mange une pomme” (John eats an apple)

Jean      mange          une        pomme
NP    (S\NP)/NP        NP/N         N
                        ─────────── >   (forward application)
                            NP
      ──────────────────────── >        (forward application)
            S\NP
──────────────────────────────── <      (backward application)
                S

Let’s break down each step:

une : NP/N looks for an N on the right. It finds pomme : N. Forward application (>) : NP/N + N → NP. Result: “une pomme” (an apple) is an NP.
mange : (S\NP)/NP looks for an NP on the right. It finds the NP “une pomme”. Forward application (>) : (S\NP)/NP + NP → S\NP. Result: “mange une pomme” (eats an apple) is an S\NP (a predicate looking for a subject).
S\NP looks for an NP on the left. It finds Jean : NP. Backward application (<) : NP + S\NP → S. Result: “Jean mange une pomme” (John eats an apple) is an S.

Combinators: beyond simple application

CCGs have additional combinators that give them mildly context-sensitive power:

Combinator	Notation	Effect
Forward Application	X/Y + Y → X	Consumes on the right
Backward Application	Y + X\Y → X	Consumes on the left
Forward Composition (B)	X/Y + Y/Z → X/Z	Chains two functions
Backward Composition (B)	Y\Z + X\Y → X\Z	Chains on the left
Forward Type-raising (T)	X → T/(T\X)	Transforms an argument into a function

Deciphering:

Without composition and type-raising combinators, CCGs would be exactly as powerful as CFGs. It is these additional operations that allow CCGs to capture cross-serial dependencies and non-context-free structures. The idea comes from the combinators of Curry and Schoenfinkel’s combinatory logic (1920s-1930s) — a formalism that eliminates variables by replacing them with higher-order functions, a precursor to lambda calculus.

The surprising equivalence

Here is one of the most remarkable results in formal language theory. In 1994, Vijay-Shanker and Weir demonstrated that four independently developed formalisms generate exactly the same class of languages:

TAG  ≡  CCG  ≡  LIG  ≡  HG

Formalism	Author(s)	Central Idea
TAG (Tree Adjoining Grammars)	Joshi (1975)	Tree manipulation + adjunction
CCG (Combinatory Categorial Grammars)	Steedman (1987)	Typed categories + combinators
LIG (Linear Indexed Grammars)	Gazdar (1988)	Stacks of indices on non-terminals
HG (Head Grammars)	Pollard (1984)	Wrapping operation (insertion of a string around a “head” element)

Four radically different approaches — trees, types, stacks, and lexical heads — that converge to the same expressiveness. This is no coincidence: it is a sign that this class of languages is natural, in the mathematical sense of the term, in the same way that regular languages are independently characterized by finite automata, regular expressions, and regular grammars.

Deciphering:

This convergence is analogous to the Church-Turing thesis (the postulate that any reasonable notion of “effective computation” is equivalent to what a Turing machine can do): when very different formalisms converge on the same notion, it is probably because that notion captures something fundamental. Here, the TAL class (Tree Adjoining Languages) seems to capture a “sweet spot” of linguistic complexity.

MCFG: the infinite hierarchy

TAGs and their equivalents are just one point on a continuous scale. MCFGs (Multiple Context-Free Grammars), introduced by Seki et al. (1991), generalize TAGs into a hierarchy parameterized by an integer k called the fan-out.

MCFG(1)  =  CFG                        (context-free)
MCFG(2)  =  TAG / CCG / LIG / HG       (mildly context-sensitive)
MCFG(3)  ⊃  MCFG(2)                    (can generate {www})
MCFG(4)  ⊃  MCFG(3)                    (can generate {wwww})
...
MCFG(k)  ⊃  MCFG(k-1)
...
Union of all MCFG(k) = semilinear languages

The fan-out k controls the number of “components” that a rule can manipulate simultaneously. A fan-out of 1 yields classic CFGs. A fan-out of 2 yields exactly TAGs. Each increment allows expressing an additional copy.

Deciphering:

The fan-out somehow measures the “number of stacks” available for computation. CFGs have one stack (fan-out 1), TAGs have two (fan-out 2), and MCFG(k) have k. Parsing complexity is O(n^(3k)), so it remains polynomial for any fixed k — but the degree of the polynomial increases.

This creates a refined Chomsky hierarchy between Type 2 and Type 1, with an infinite number of intermediate steps. The practical question then is: what fan-out is necessary for natural language? The current consensus is that fan-out 2 (TAG/CCG) is sufficient for the vast majority of syntactic phenomena, with perhaps fan-out 3 or 4 for a few exotic constructions in certain languages.

Applications to music

The mildly context-sensitive zone is not just a theoretical object for linguists. It has direct and profound applications in musical analysis.

Steedman: jazz grammars with CCG

Mark Steedman (the same one who developed CCGs for linguistics) applied his formalism to jazz chord progressions as early as the 1980s-90s. The idea: chords do not follow each other randomly; they obey combination rules analogous to the syntax of natural languages.

For example, a V7 → I resolution (dominant seventh chord to tonic chord — the fundamental harmonic resolution in tonal music) is analogous to applying an argument to a verb. A ii-V-I (the subdominant → dominant → tonic progression, for example Dm7 → G7 → Cmaj7 in C major — the most common jazz pattern) is a composition of harmonic functions. Recursivity and long-distance dependencies (a tension introduced in one measure and resolved much later) require a formalism more powerful than CFGs.

Sidebar: the same progression in BP3

In CCG, Steedman models a ii-V-I turnaround (a sequence of chords that returns to the tonic to restart a new chorus) as a composition of typed functions: the ii7 (category (I\I)/V) “awaits” a V7 to its right to form a cadential movement, which itself “awaits” an I to resolve. This is the application of harmonic functions.

I2 (Bol Processor 3, algorithmic composition software based on formal grammars — see I2) expresses the same structure with a weighted grammar (where each rule has a numerical weight <n> influencing its selection probability) and a flag (conditional variable that controls which rules can apply — see B4):
gram#1[1] /ch=4/  |Jazz| --> |Tour| |Jazz|  /ch-1/
gram#1[2] /ch<1/  |Jazz| --> |Coda|
gram#1[3] <3>     |Tour| --> Dm7 G7 Cmaj7
gram#1[4] <2>     |Tour| --> Dm7 Db7 Cmaj7
gram#1[5] <1>     |Tour| --> Bb7 A7 Dm7 G7 Cmaj7
gram#1[6]         |Coda| --> Dm7 G7 Cmaj7 _

Rule [3] generates the classic ii-V-I turnaround (weight 3, the most frequent). Rule [4] uses tritone substitution (replacement of a dominant chord by another located a tritone — three whole tones — away: Db7 instead of G7, because both share the same tension notes). Rule [5] adds an extended cycle of fifths (a chain of V→I resolutions: Bb7→A7→Dm7→G7→Cmaj7). The ch flag counts the choruses (complete rounds of the harmonic grid): after 4 rounds, the /ch<1/ condition triggers the coda (concluding section of the piece).

Where CCGs compose typed functions (the V7 “consumes” the ii7), BP3 composes weighted sequences with state control. The musical result is the same; the formal mechanisms are different. Both live in the mildly context-sensitive zone — the ch flag plays a role analogous to CCGs’ composition combinators: it coordinates the global structure beyond context-free.

Rohrmeier: tonal harmony as tree structure

Martin Rohrmeier and his collaborators have developed models of tonal harmony using tree structures inspired by TAGs. The idea is that harmonic prolongations and digressions are adjunction operations: material is inserted in the middle of a structural progression, exactly as an adverb is inserted into a sentence.

Basic progression:     I ──── V ──── I
                           (tension → resolution)

With adjunction:         I ── IV ── ii ── V ──── I
                             └──── adjunction ────┘

BP3: positioning in the hierarchy

Where does BP3 fit into this hierarchy? It’s a subtle question:

Without flags: BP3 grammars are essentially context-free (derivation by rewriting non-terminals)
With bounded flags: flags allow conditioning rule application based on a finite global state, which gives power comparable to mildly context-sensitive (TAG/CCG)
With unbounded flags (if allowed): this would reach Turing-complete power (capable of computing anything a Turing machine can compute — the theoretical maximum of computational power)

In practice, BP3 flags are bounded (finite number of possible values), which comfortably places BP3 in the MCS zone.

Refined Chomsky Hierarchy — BP3 positioning:

Type 3 (regular)                        Finite automata
  │
Type 2 (context-free)                    CFG, pushdown automata
  │   BP3 without flags ← here
  │
MCFG(2) = TAG/CCG (mildly CS)           Parsing O(n⁶)
  │   BP3 with bounded flags ← here
  │
MCFG(3), MCFG(4) ...                    Parsing O(n⁹), O(n¹²)...
  │
Type 1 (context-sensitive)               PSPACE-complete
  │
Type 0 (recursively enumerable)        Turing-complete

Key takeaways

Chomsky’s Gap: Between context-free languages (Type 2) and context-sensitive languages (Type 1), there is an immense and structured zone — mildly context-sensitive languages — that captures exactly the complexity needed for natural language and music.
Joshi’s Criteria: An MCS formalism must (a) contain CFLs, (b) capture cross-serial dependencies, (c) preserve semilinearity, and (d) allow polynomial parsing.
TAGs and Adjunction: Tree Adjoining Grammars manipulate trees with two operations (substitution and adjunction). Adjunction — inserting a tree in the middle of another — is the key to their expressiveness.
CCGs and Combinators: Combinatory Categorial Grammars encode syntax in typed categories combined by operators (application, composition, type-raising).
The TAG/CCG/LIG/HG Equivalence: Four independent formalisms converge to the same class of languages, a sign that it is mathematically natural.
MCFGs and Fan-out: MCFG(k) create an infinite hierarchy between CFGs and CSGs (context-sensitive grammars), parameterized by the fan-out k. TAGs correspond to k = 2.
In Music: Steedman and Rohrmeier use CCGs and TAGs to model harmonic progressions. BP3 with bounded flags is located in the mildly context-sensitive zone.

Glossary

Adjunction: Central operation of TAGs. Consists of inserting an auxiliary tree at an internal node of an existing tree, by “cutting” the subtree and reattaching it to the foot node. This is what distinguishes TAGs from CFGs.
CCG (Combinatory Categorial Grammar): Grammatical formalism where each word carries a typed category (like (S\NP)/NP for a transitive verb) and words combine via operators (application, composition, type-raising).
Combinator: Combination operation in CCGs. The main ones are application (forward/backward), composition (B), and type-raising (T). They originate from Curry’s combinatory logic.
Cross-serial dependencies: Syntactic structure where two sequences of elements correspond element by element, like subjects and verbs in Dutch (A₁ A₂ A₃ B₁ B₂ B₃ where Aᵢ depends on Bᵢ).
Fan-out: Parameter of MCFGs that measures the number of components a rule can manipulate simultaneously. Fan-out 1 = CFG, fan-out 2 = TAG.
Mildly context-sensitive: Class of languages satisfying Joshi’s criteria: contains CFLs, captures cross-serial dependencies, is semilinear, and allows polynomial parsing.
Polynomial parsing: A syntactic analysis algorithm whose execution time is bounded by a polynomial in the length of the input. For TAGs: O(n⁶). For CFGs (CYK algorithm, Cocke-Younger-Kasami): O(n³).
Semilinearity: Property of a language whose Parikh vectors (counting the occurrences of each symbol) form a semilinear set. Intuitively, word lengths grow regularly.
Substitution: TAG operation that replaces a marked leaf node (↓) with an initial tree of the corresponding type. Analogous to derivation in a CFG.
TAG (Tree Adjoining Grammar): Grammatical formalism that manipulates elementary trees (initial and auxiliary) via two operations: substitution and adjunction. Introduced by Joshi (1975).

Prerequisites: L1, L2
Reading time: 12 min
Tags: #mildly-context-sensitive #tag #ccg #chomsky-hierarchy

Previous article: L8
Next article: L10