Context-free Languages

Context-Free Grammar

A context-free grammar is G = (V,Σ,R,S), where The set V—Σ = the set of non-terminal symbols.

It is standard to write "A ⇾ α" when (A,α) ∈ R.

Notational Conventions

It is common in practice to express a grammar simply in terms of its productions as follows: As an example consider the grammar:
S ⇾ aSb | ε
The implication is that V = {S,a,b}, Σ = {a,b}, the start symbols is S and the rules are:
S ⇾ aSb
S ⇾ ε

Derivation

A derivation is a sequence of steps which begins with the start symbol, uses the rules to do replacements, and ends with a terminal string (from which no further steps can be taken).

Formally, one-step derivation, , is a binary relation on V*. When (α,β) belongs to this relation, we write α ⇒ β. The definition is:
α ⇒ β  if and only if  α = α12 and β = α1γα2 where A ⇾ γ is a rule.
The derives relation, *, is the usual reflexive, transitive closure of .

We write L(G) for the language generated by grammar G as { w ∈ Σ*: S ⇒* w }.

The intermediate forms appearing in derivations are strings in V* containing both terminals and non-terminals. These are called sentential forms.

A simple derivation using the above grammar is:
S ⇒ aSb ⇒ aaSbb ⇒ aaaSbbb ⇒ aaabbb = a3b3

Leftmost and Rightmost Derivations

Two particularly important derivations are these: These two derivations are important because any parsing program (i.e., one which attempts to construct a derivation) would construct one of these two. A grammar is said to be ambiguous (which we'll defined in another sense in the next section) if a string has more than one leftmost, or more than one rightmost derivation.

It seems intuitively obvious that if a derivation exists, these two forms should exist, but one technically needs to make a sounder argument, which is done so in the next section of the textbook.

Equal a's and b's

The language of equal a's and b's and its variants are our main examples of non-regular languages. As a simple example, we'll prove that the language generated by the grammar G
S ⇾ aSb | ε
is L = { anbn : n ≥ 0 }.

Have to show 2 inclusions:
  1. L ⊆ L(G): Everything in the language can be generated by the grammar
  2. L(G) ⊆ L: Everything generated by the grammar is in the language.
Proofs:
  1. Everything in the language is generate by a derivation in the grammar from S.

    This is usually proved by induction on the length of strings in the language. In this case, take w = anbn and prove by induction that S ⇒* w. For n = 0, obvious. Otherwise, assume true for n, i.e.
    S ⇒* anbn
    
    Then
    S ⇒ aSb ⇒* aanbnb = an+1bn+1
    
    and thus S ⇒* an+1bn+1.
  2. Everything derived by the grammar from S is in the language.

    Usually, for these kinds of proofs, you have to say something about all the sentential forms derivable from all (useful) symbols. Fortunately, there's only one symbol in this simple grammar.

    Claim:
    If S ⇒* α then either
        α = aibi, some i ≥ 0  or  α = aiSbi, some i ≥ 0,
    
    Proof by induction on the length of the derivation. For length 1, obvious. Now for n ≥ 1, take a derivation of length n+1:
    S ⇒n α ⇒ β 
    
    By induction, and to have the ability to take one more step, α = aiSbi, for some i ≥ 0. Thus for either of the replacements S ⇾ aSb or S ⇾ ε, β must be of the correct format.

Equal a's and b's in general

More difficult is generating the language of strings with equal number of a's and b's in any arrangement
L = { w ∈ {a,b}* : a(w) = b(w) }
We're using the notation:
a(w) = number of a's in a string w
b(w) = number of b's in a string w
This language can be generated by the following grammar:
S ⇾ aSbS | bSaS | ε
Of the two proofs (A) and (B), (B) is simpler. We simply argue by induction that:
If S ⇒* α, then a(α) = b(α) (ignore interspersed S's, if any); and so L(G) ⊆ L.
The harder part is showing that any string in L can be derived by this grammar. Consider the derivation of the string aababb:
S ⇒ aSbS ⇒ aSb ⇒ aaSbSb ⇒ aabSb ⇒ aabaSbSb ⇒2 aababb
The proof by induction on the string length is left as an exercise.

It is interesting to compare this grammar to the grammar from problem 3.1.5 which also generates the same language.

More Examples

Arithmetic expressions. The terminals are {+,*,(,),a}. The symbol a represents a token that we identify by other means. It could be a number, variable, etc.
E ⇾ E + T | E - T | T    E is Expression
T ⇾ T * F | F            T is Term
F ⇾ a | ( E )            F is Factor
When we represent grammars related to programming languages, it is assumed that the symbols are taken from a tokenized stream. Thus "x * ( 3 + z)" and "x*(3+z)", though different as strings, are the same expression. These rules dictate two major features of arithmetic expressions:
  1. The +, - and * operations group their operands in a left-associative manner, i.e., x - y - z is effectively (x - y) - z. The left associativity disambiguates a sequence of equal precedence operators from a grammar point of view.
  2. The * operator has higher precedence than + and -, i.e. x + y * z is effectively x + (y * z).
Of course + and * are associative (under normal circumstances) and so it should not matter so much about expressing left associativity. But arithmetic associativity is a semantic issue, not a syntactic one. In fact, + need not actually be associative nor commutative in program expressions with side effects such as this one:
$x + (++$x) + (++$x)
Consider an example of two different derivations of the expression x + y * z:
E ⇒ E + T
  ⇒ T + T
  ⇒ F + T
  ⇒ x + T
  ⇒ x + T * F
  ⇒ x + F * F
  ⇒ x + y * F
  ⇒ x + y * z
E ⇒ E + T
  ⇒ E + T * F
  ⇒ E + F * F
  ⇒ T + F * F
  ⇒ T + F * z
  ⇒ F + F * z
  ⇒ F + y * z
  ⇒ x + y * z
What's the difference? The first one is a leftmost derivation whereas, the second one has no particular replacement strategy.

Ambiguous Arithmetic Expression Grammar

Compare the above expression grammar to a simpler one:
E ⇾ E + E | E - E | E * E | a | (E)
Why not use this simpler one? It generates the expression language, but it is ambiguous. The following is a proof of ambiguity once we know more about what that means:
E ⇒ E + E       E is a sum of expressions
  ⇒ x + E
  ⇒ x + E * E   the second is a product
  ⇒ x + y * E
  ⇒ x + y * z
E ⇒ E * E        E is a product of expressions
  ⇒ E + E * E    the first is a sum
  ⇒ x + E * E
  ⇒ x + y * E
  ⇒ x + y * z

Balanced Parentheses

The grammar of balanced parentheses can be obtained by taking the ambiguous expression grammar and remove the +, -, * and a by setting them to ε. What's left boils down to this:
S ⇾ S S | (S) | ε
A sample derivation is this:
S ⇒ ( S ) ⇒ ( S S ) ⇒ ( ( S ) S ) ⇒ ( ( S ) ( S ) ) ⇒ ( ( ) ( S ) ) ⇒ ( ( ) ( ) )

Regular Expression Grammars

See problem 3.1.4. Regular expressions over Σ are actually defined by a CFG. We use additional terminals {(,),*,∪,∅}. According to the textbook's notion of fully parenthesized regular expressions as defined in section 1.8, the associated grammar is this:
R ⇾ (R∪R) | (RR) | R* | ∅ | σ  (for σ ∈ Σ)
The grammar is:
R ⇾ R∪B | B
B ⇾ BQ | Q
Q ⇾ A* | A
A ⇾ (R) | ∅ | σ  (for σ ∈ Σ)
Recall that B is Branch, Q is Quantified atom and A is Atom. Notice the similarity with the arithmetic expression grammars in how operator precedence is expressed. In this case there are three operators: , juxtaposition, *.

Keep in mind the relationship between regular expressions and regular languages. Every regular expression is effectively a regular language, but the language of all regular expressions is context-free language and not regular

Ambiguous if-else grammar

S ⇾ if (E) S | if (E) S else S
S ⇾ other
E ⇾ expr
Find two left-most derivations for:
if (expr) if (expr) other else other

#'s of a's and b's

More a's:
U ⇾ TaU | TaT
T ⇾ aTbT | bTaT | ε
More b's:
V ⇾ TbV | TbT
T ⇾ aTbT | bTaT | ε
Unequal:
S ⇾ U | V
U ⇾ TaU | TaT
V ⇾ TbV | TbT
T ⇾ aTbT | bTaT | ε

LISP-like S-expression grammar

S ⇾ a | ( L )
L ⇾ S L | ε
Example: derive ( ( a a ) ( a a ) () )

What makes expression languages not regular

Take the ambiguous grammar G for arithmetic expressions:
E ⇾ E+E | E-E | E*E | a | (E)
What about L(G) makes it non-regular? It's the parentheses. Take the subset:
L(G) ∩ (*a)*   =  { (na)n : n ≥ 0 }
Without the rule E ⇾ (E), L(G) would be regular:
(a(+*-))*a

Regular Languages are Context Free

S ⇾ aS | ε
This theorem gives us our first result in the Chomsky Hierarchy that Regular Languages correspond to certain types of Context Free Grammars.

Proof: Take an automaton (NFA or DFA) M = (K,Σ,δ,s,F) for the Language.

The grammar, G, is constructed as follows: Claim: L(M) = L(G).

In general prove by induction the relationship between yields and derives:
(p,w) |-* (q,ε)  ⇔  p ⇒* wq


© Robert M. Kline