## Context-free Languages

### Context-Free Grammar

A context-free grammar is G = (V,Σ,R,S), where
• V = an alphabet containing all grammar symbols
• Σ = a subset of V, called the terminal symbols
• R ⊆ (V-Σ)×V* is the set of substitution rules, or productions
• S ∈ V-Σ is the start symbol
The set V—Σ = the set of non-terminal symbols.

It is standard to write "A ⇾ α" when (A,α) ∈ R.

#### Notational Conventions

It is common in practice to express a grammar simply in terms of its productions as follows:
• The full symbol set V consists of only symbols which appear in the grammar.
• All uppercase letters are non-terminals and everything else is a terminal symbol.
• The start symbol is the non-terminal on the left hand side of the first production rule listed.
• Rules with common left hand sides are combined with right-hand sides separated by "|"
As an example consider the grammar:
```S ⇾ aSb | ε
```
The implication is that V = {S,a,b}, Σ = {a,b}, the start symbols is S and the rules are:
```S ⇾ aSb
S ⇾ ε
```

#### Derivation

A derivation is a sequence of steps which begins with the start symbol, uses the rules to do replacements, and ends with a terminal string (from which no further steps can be taken).

Formally, one-step derivation, , is a binary relation on V*. When (α,β) belongs to this relation, we write α ⇒ β. The definition is:
```α ⇒ β  if and only if  α = α1Aα2 and β = α1γα2 where A ⇾ γ is a rule.
```
The derives relation, *, is the usual reflexive, transitive closure of .

We write L(G) for the language generated by grammar G as { w ∈ Σ*: S ⇒* w }.

The intermediate forms appearing in derivations are strings in V* containing both terminals and non-terminals. These are called sentential forms.

A simple derivation using the above grammar is:
```S ⇒ aSb ⇒ aaSbb ⇒ aaaSbbb ⇒ aaabbb = a3b3
```

#### Leftmost and Rightmost Derivations

Two particularly important derivations are these:
• leftmost: always replace the leftmost non-terminal
• rightmost: always replace the rightmost non-terminal
These two derivations are important because any parsing program (i.e., one which attempts to construct a derivation) would construct one of these two. A grammar is said to be ambiguous (which we'll defined in another sense in the next section) if a string has more than one leftmost, or more than one rightmost derivation.

It seems intuitively obvious that if a derivation exists, these two forms should exist, but one technically needs to make a sounder argument, which is done so in the next section of the textbook.

### Equal a's and b's

The language of equal a's and b's and its variants are our main examples of non-regular languages. As a simple example, we'll prove that the language generated by the grammar G
```S ⇾ aSb | ε
```
is L = { anbn : n ≥ 0 }.

Have to show 2 inclusions:
1. L ⊆ L(G): Everything in the language can be generated by the grammar
2. L(G) ⊆ L: Everything generated by the grammar is in the language.
Proofs:
1. Everything in the language is generated by a derivation in the grammar from S.

This is usually proved by induction on the length of strings in the language. In this case, take w = anbn and prove by induction that S ⇒* w. For n = 0, obvious. Otherwise, assume true for n, i.e.
```S ⇒* anbn
```
Then
```S ⇒ aSb ⇒* aanbnb = an+1bn+1
```
and thus S ⇒* an+1bn+1.
2. Everything derived by the grammar from S is in the language.

Usually, for these kinds of proofs, you have to say something about all the sentential forms derivable from all (useful) symbols. Fortunately, there's only one symbol in this simple grammar.

Claim:
```If S ⇒* α then either
α = aibi, some i ≥ 0  or  α = aiSbi, some i ≥ 0,
```
Proof by induction on the length of the derivation. For length 1, obvious. Now for n ≥ 1, take a derivation of length n+1:
```S ⇒n α ⇒ β
```
By induction, and to have the ability to take one more step, α = aiSbi, for some i ≥ 0. Thus for either of the replacements S ⇾ aSb or S ⇾ ε, β must be of the correct format.

#### Equal a's and b's in general

More difficult is generating the language of strings with equal number of a's and b's in any arrangement
```L = { w ∈ {a,b}* : a(w) = b(w) }
```
We're using the notation:
```a(w) = number of a's in a string w
b(w) = number of b's in a string w
```
This language can be generated by the following grammar:
```S ⇾ aSbS | bSaS | ε
```
Of the two proofs (A) and (B), (B) is simpler. We simply argue by induction that:
If S ⇒* α, then a(α) = b(α) (ignore interspersed S's, if any); and so L(G) ⊆ L.
The harder part is showing that any string in L can be derived by this grammar. Consider the derivation of the string aababb:
```S ⇒ aSbS ⇒ aSb ⇒ aaSbSb ⇒ aabSb ⇒ aabaSbSb ⇒2 aababb
```
The proof by induction on the string length is left as an exercise.

It is interesting to compare this grammar to the grammar from problem 3.1.5 which also generates the same language.

### More Examples

Arithmetic expressions. The terminals are {+,*,(,),a}. The symbol a represents a token that we identify by other means. It could be a number, variable, etc.
```E ⇾ E + T | E - T | T    E is Expression
T ⇾ T * F | F            T is Term
F ⇾ a | ( E )            F is Factor
```
When we represent grammars related to programming languages, it is assumed that the symbols are taken from a tokenized stream. Thus "x * ( 3 + z)" and "x*(3+z)", though different as strings, are the same expression. These rules dictate two major features of arithmetic expressions:
1. The +, - and * operations group their operands in a left-associative manner, i.e., x - y - z is effectively (x - y) - z. The left associativity disambiguates a sequence of equal precedence operators from a grammar point of view.
2. The * operator has higher precedence than + and -, i.e. x + y * z is effectively x + (y * z).
Of course + and * are associative (under normal circumstances) and so it should not matter so much about expressing left associativity. But arithmetic associativity is a semantic issue, not a syntactic one. In fact, + need not actually be associative nor commutative in program expressions with side effects such as this one:
```\$x + (++\$x) + (++\$x)
```
Consider an example of two different derivations of the expression x + y * z:
 ```E ⇒ E + T ⇒ T + T ⇒ F + T ⇒ x + T ⇒ x + T * F ⇒ x + F * F ⇒ x + y * F ⇒ x + y * z ``` ```E ⇒ E + T ⇒ E + T * F ⇒ E + F * F ⇒ T + F * F ⇒ T + F * z ⇒ F + F * z ⇒ F + y * z ⇒ x + y * z ```
What's the difference? The first one is a leftmost derivation whereas, the second one has no particular replacement strategy.

#### Ambiguous Arithmetic Expression Grammar

Compare the above expression grammar to a simpler one:
```E ⇾ E + E | E - E | E * E | a | (E)
```
Why not use this simpler one? It generates the expression language, but it is ambiguous. The following is a proof of ambiguity once we know more about what that means:
 ```E ⇒ E + E E is a sum of expressions ⇒ x + E ⇒ x + E * E the second is a product ⇒ x + y * E ⇒ x + y * z ``` ```E ⇒ E * E E is a product of expressions ⇒ E + E * E the first is a sum ⇒ x + E * E ⇒ x + y * E ⇒ x + y * z ```

#### Balanced Parentheses

The grammar of balanced parentheses can be obtained by taking the ambiguous expression grammar and remove the +, -, * and a by setting them to ε. What's left boils down to this:
```S ⇾ S S | (S) | ε
```
A sample derivation is this:
```S ⇒ ( S )
⇒ ( S S )
⇒ ( ( S ) S )
⇒ ( ( S ) ( S ) )
⇒ ( ( ) ( S ) )
⇒ ( ( ) ( ) )
```

#### Regular Expression Grammars

See problem 3.1.4. Regular expressions over Σ are actually defined by a CFG. We use additional terminals {(,),*,∪,∅}. According to the textbook's notion of fully parenthesized regular expressions as defined in section 1.8, the associated grammar is this:
```R ⇾ (R∪R) | (RR) | R* | ∅ | σ  (for σ ∈ Σ)
```
The actual grammar which generates our "usual" regular expressions needs to incorporate operator precedence:
```R ⇾ R∪B | B
B ⇾ BQ | Q
Q ⇾ A* | A
A ⇾ (R) | ∅ | σ  (for σ ∈ Σ)
```
Recall that B is Branch, Q is Quantified atom and A is Atom.

Notice the similarity with the arithmetic expression grammars in how operator precedence is expressed. In this case there are three operators at different precedence levels: , juxtaposition, *.

Here is a derivation of (aUb)*ba*. The 3 levels of operator precedence tend to lengthen these derivations.
```R ⇒ B
⇒ BQ
⇒ BBQ
⇒ QBQ
⇒ A*BQ
⇒ (R)*BQ
⇒ (RUB)*BQ
⇒ (BUB)*BQ
⇒ (AUB)*BQ
```
```  ⇒ (aUB)*BQ
⇒ (aUA)*BQ
⇒ (aUb)*BQ
⇒ (aUb)*QQ
⇒ (aUb)*AQ
⇒ (aUb)*bQ
⇒ (aUb)*bA*
⇒ (aUb)*ba*
```

Keep in mind the relationship between regular expressions and regular languages. Every regular expression is effectively a regular language, but the language of all regular expressions is context-free language and not regular

#### Ambiguous if-else grammar

```S ⇾ if (E) S | if (E) S else S
S ⇾ other
E ⇾ expr
```
Find two left-most derivations for:
```if (expr) if (expr) other else other
```

#### #'s of a's and b's

More a's:
```U ⇾ TaU | TaT
T ⇾ aTbT | bTaT | ε
```
More b's:
```V ⇾ TbV | TbT
T ⇾ aTbT | bTaT | ε
```
Unequal:
```S ⇾ U | V
U ⇾ TaU | TaT
V ⇾ TbV | TbT
T ⇾ aTbT | bTaT | ε
```

#### LISP-like S-expression grammar

```S ⇾ a | ( L )
L ⇾ S L | ε
```
Example: derive ( ( a a ) ( a a ) () )

#### What makes expression languages not regular

Take the ambiguous grammar G for arithmetic expressions:
```E ⇾ E+E | E-E | E*E | a | (E)
```
What about L(G) makes it non-regular? It's the parentheses. Take the subset:
```L(G) ∩ (*a)*   =  { (na)n : n ≥ 0 }
```
Without the rule E ⇾ (E), L(G) would be regular:
```(a(+∪*∪-))*a
```

### Regular Languages are Context Free

```S ⇾ aS | ε
```
This theorem gives us our first result in the Chomsky Hierarchy that Regular Languages correspond to certain types of Context Free Grammars.

Proof: Take an automaton (NFA or DFA) M = (K,Σ,δ,s,F) for the Language.

The grammar, G, is constructed as follows:
• non-terminals = K
• start symbol = s
• productions:
```p ⇾ σq    if δ(p,σ) = q
f ⇾ ε     if f ∈ F
```
Claim: L(M) = L(G).

In general prove by induction the relationship between yields and derives:
```(p,w) ⊢* (q,ε)  ⇔  p ⇒* wq
```
The grammar constructed from an NFA is called right regular in that non-terminals only appear on the right side of a rule, like
```A ⇾ aB
```

#### Right Regular

A right-regular grammar is even more general: all rules are of the form:
```A ⇾ wB
A ⇾ w
```
for w ∈ Σ*. One can easily argue that a right regular grammar maps to an NFA, let the non-terminals be states and add states for all the symbols in the strings. For example with the rule
```A ⇾ abB
```
use an intermediate state S1 and create the NFA portion:
```  a    b
A ⇾ S1 ⇾ B
```
For the rule
```A ⇾ baa
```
use intermdiate states T1, T2 and final state T3:
```  b    a    a
A ⇾ T1 ⇾ T2 ⇾ T3
```

#### Left Regular

A grammar is left regular if all the rules are of the form:
```A ⇾ Bw
A ⇾ w
```
for w ∈ Σ*. It can rather easily be argued that left-regular grammars generate regular languages. If G is any grammar, let GR be the one where the rule targets are reversed, i.e. if A ⇾ x ∈ R, then A ⇾ xR GR. Then argue that
```L(GR) = L(G)R
```
If G is left-regular, then GR is right-regular, hence L(GR) is a regular language, meaning that
```L(G) = L(GR)R
```
must be regular.

#### Mixing left- and right- regular rules

If you have some of both, you do not retain regularity of the language. Consider this grammar:
```A ⇾ aB
B ⇾ Cb
C ⇾ A | ε
```
Consider this derivation:
```A ⇒ aB ⇒ aCb ⇒ aAb
```
You can convince yourself that the grammar language is our favorite {anbn}.