Grammar Constructions

This section jumps ahead a bit. We're discussing results from sections 3.5 and 3.6. In particular, the textbook's definition of Chomsky Normal Form in 3.6 is different from what we present here.

Basic Context-free Language closure properties

CFL's satisfy the following closure properties:
  1. If L1 and L2 are CFLs then so is L1 ∪ L2.
  2. If L1 and L2 are CFLs then so is L1 · L2.
  3. If L is a CFL then so is the Kleene-star: L*.
Proof. For union and concatenation, take G1=(V11,R1,S1) and G2=(V22,R2,S2).

First of all, and very importantly, rename the terminals so that the two grammars have no non-terminals in common, i.e.,
(V1 - Σ1) ∩ (V2 - Σ2) = ∅
Choose a "new" S, i.e., S ∉ V1 ∪ V2.

For union and concatenation, the new grammar is
V′ = V1 ∪ V2 ∪ {S}
For union,
R′ = R1 ∪ R2 ∪ {S → S1, S → S2}
For concatenation:
R′ = R1 ∪ R2 ∪ {S → S1S2}

For Kleene*, take any grammar: G=(V,Σ,R,S). Choose a "new" start symbol S′ ∉ V.

The grammar for L(G)* is (V′,Σ,R′,S′) with
V′ = V ∪ {S′}
R′ = R ∪ {S′ ⇾ S′S, S ⇾ ε}

Intersection closure properties

In a subsequent section we will see a pumping lemma equivalent for CFLs. It turns out that this language is not context-free:
{ anbncn : n ≥ 0 }
However we can construct it as the intersection of two context-free languages. Thus CFLs are not closed under intersection. Consequently they cannot be closed under complementation because if they were, then using DeMorgan's Laws plus closure under union, we would have closure under intersection.

There is a partial result which is true: the intersection of a CFL and a regular language is a CFL. In order to prove this we need the alternative machine model for CFLs called PDAs. That is coming up next.

Language Algorithms

Language algorithms are usually framed as so-called decision problems: given a language or languages or Let's consider the situation for regular languages. Suppose we have regular languages L, L1 and L2 over Σ. All of the following questions are decidable in the sense that there is an algorithmic way to obtain the answer:
  1. is ε ∈ L?
  2. is w ∈ L, for a given w ∈ Σ*?
  3. is L = ∅?
  4. is L = Σ*?
  5. is L1 ⊆ L2?
  6. is L1 = L2?
  7. is L1 ∩ L2 = ∅?
All of these can be answered using the DFA model for regular languages. An important point is that there are DFA constructions for every conceivable way of combining regular languages. For example, if we know how to decide if L = ∅, then, using the complement construction, we know of a way to decide if L = Σ*.

When we move to context-free languages L (with underlying grammars) L1 and L2, it turns out that only the first 3 of these questions are decidable:
  1. is ε ∈ L?
  2. is w ∈ L, for a given w ∈ Σ*?
  3. is L = ∅?
To say that the question L = Σ* is undecidable means that there is no algorithm which can take an arbitrary grammar and say yes or no to this question.

CFLs not closed under complementation

It is important to understand what makes it work for regular languages, but not for context-free languages. The crux of the problem is that regular languages are closed under complementation but context-free languages are not.

Eliminate useless symbols

A symbol X (terminal or non-terminal) is useful if it appears in the derivation from the start symbol to a terminal string, i.e.,
S ⇒* αXβ ⇒* w ∈ Σ*
A symbol which is not useful is useless. We can ignore it and all the rules it appears in. Useful means:
  1. connected from the start
  2. connected to the end
However, satisfying each separately does not necessarily make a symbol useful. For example, consider the grammar:
S → aAB | a
B → Bb
C → b
A → a
A is useless even though S ⇒* aAB and A ⇒* a. In fact, the symbols A, B, C and b are all useless; then only useful rule is:
S → a

Step 1: eliminate non-terminals which don't derive anything

We start by algorithmically computing the "end half" of usefulness as follows:
OLD = ∅
NEW = { A : A → α ∈ Σ* }
while (OLD != NEW) {
  OLD = NEW;
  NEW = OLD ∪ { A : A → α ∈ (Σ ∪ OLD)* }
When this terminates, discard all rules which do use non-terminals not in NEW.

For example, consider the grammar ({S,A,B,C,a,b,c},{a,b,c},R,S) with R:
S → Aa | aBb
A → bAb
B → bBb | A | a
C → a | c
B and C would be added initially. The following step S is added because S → aBb. At this point we would eliminate A, getting the reduced grammar ({S,B,C,a,b,c},{a,b,c},R1,S) with R1:
S → aBb
B → bBb | a
C → a | c

Algorithmic determination whether L(G) is not empty

We can algorithmically determine whether L(G) ≠ ∅ as follows:
L(G) ≠ ∅ if and only if S is in the remaining set of non-terminals computed by step 1.

Step 2: eliminate symbols which are not reachable from S

After applying step 1, we have potentially reduced the non-terminals and rule set. If we discover that L(G) = ∅, we do not proceed. Otherwise, we run a similar algorithm, this one going from the start symbol "down":
OLD = ∅
NEW = { S }
while (OLD != NEW) {
  OLD = NEW;
  NEW = OLD ∪ { X ∈ V : A → αXβ, A ∈ OLD }
At this point discard all rules with symbols not in NEW.

Here we are trying to compute all symbols, terminals or non-terminals, which appear in a derivation from the start symbol. This leaves us with the final grammar with all useful symbols and useful rules.

In our example, the initialization sets NEW = {S}, the next step we get NEW = {S,a,B,b}. The next step using B gives nothing new and so we have the final grammar ({S,B,a,b},{a,b},R1,S) with R2:
S → aBb
B → bBb | a

Eliminate epsilon productions

Given G=(V,Σ,R,S), assume it is modified so that all symbols are useful. The idea is as follows:
  1. decide whether ε ∈ L(G)
  2. remove all rules of the form A → ε
  3. if ε ∈ L(G), create a new start symbol S′ and add: S′ → S | ε
The point is that if ε ∈ L(G), then we have a production S′ → ε, but S′ never appears on the right-hand side of a production. So this is what we really mean:

Theorem: Every grammar G can be transformed into an equivalent one in which the only ε-production (if necessary) is of the form S → ε where the start symbol never appears on the right-hand side of a production.

Erasable (nullable) non-terminals

The proof is based on the following computation of the set of erasable, or nullable non-terminals:
Ε = { A ∈ V - Σ : A ⇒* ε }
The computation is this:
OLD = ∅
NEW = { A : A → ε }
while ( NEW ≠ OLD ) {
  OLD = NEW;
  NEW = OLD ∪ { A : A → α ∈ OLD* }
Ε = NEW;
For example, with this grammar (with all useful symbols):
C → AB | c
Q → Qa | b
B → ε | b | abDQ
A → ε | a
A and B would be added on first pass, C added the next pass and D the third pass.

Algorithmic determination whether ε ∈ L(G)

We can algorithmically determine whether ε ∈ L(G) by computing the set, Ε of erasable non-terminals and checking if S ∈ Ε. Implicit in the computation of erasable symbols is the assumption that all symbols are useful, otherwise we could have A → ε for a useless rule and have incorrectly judged that ε ∈ L(G).

Grammar modification

With the above grammar the idea is to eliminate these two rules:
A → ε
B → ε
How should we compensate? For each rule A → α, add new rules gotten by omitting one or more occurrence of the erasable symbols.

Back to Example

In this example, with erasable symbols A, B, C, D, this would mean adding the rules:
B → abQ
C → A | B
D → AB | AC | BC | A | B | C
In our example, ε ∈ L(G), and so we add
S → D | ε
Our final grammar is this:
S → D | ε
D → ABC | AB | AC | BC | A | B | C
Q → Qa | b
C → AB | c | A | B
B → b | abDQ | abQ
A → a

The anbn language

With the grammar
S ⇾ aSb | ε
The start symbol is nullable. For the rule S ⇾ aSb, make the replacement of S by ε to get the added rule S ⇾ ab. The final grammar is:
S′ ⇾ S | ε
S ⇾ aSb | ab

Balanced Parentheses

Let's consider the balanced parentheses grammar:
S → SS | (S) | ε
The only symbol S is nullable and so the grammar would transform into the following:
S′ → S | ε
S  → SS | (S) | S | ()
Note that the rule S → S cannot possibly add anything to L(G), and so we omit this, getting:
S′ → S | ε
S  → SS | (S) | ()

Eliminate unit productions

A unit production is one of the form
A → B
Once we have eliminated ε-productions, we can guarantee that:
A ⇒* B   if and only if   the derivation consists entirely of unit productions
because otherwise, another symbol appearing in the derivation could never disappear. So we can readily determine by graphical means whether A ⇒* B.

Grammar Modifications

Assume that we have already executed the previous two steps: Then complete the process as follows:
  1. Remove all the unit productions.
  2. For each remaining (non-unit) production B → α,   add A → α   if A ⇒* B.
  3. Again, identify and remove useless symbols (step 2 only), because the removal of the unit productions may have caused some non-terminals to be unreachable from S.
For example, if we have:
S → A | Bb
A → C | a
B → aBa | b
C → aSa 
then the unit productions are:
S → A
A → C
and we obviously also have S ⇒* C, and so, when unit productions are eliminated, we get:
S → Bb | a | aSa
A → a | aSa
B → aBa | b
C → aSa 
Now both A and C have become useless because they are unreachable from S, and our final modified grammar becomes:
S → Bb | a | aSa
B → aBa | b

Balanced Parentheses

For the balanced parentheses, we would technically have to replace the S′ → S production from the ε-removed grammar, getting the grammar:
S′ → SS | (S) | () | ε
S  → SS | (S) | ()

Chomsky Normal Form

According to the textbook, a grammar is in Chomsky Normal Form if the RHS of each production has length exactly 2. Not every textbook is quite this stringent. We'll use a more relaxed version of CNF.

A grammar is in CNF if all productions are one of two forms: To give a complete formulation for any Context Free Language, L, we need to say one other thing: We want to show:
Theorem: Every grammar has an equivalent grammar in CNF.
Apply all steps up to this point in sequence:
  1. eliminate useless symbols (and associated productions)
  2. eliminate ε-productions, thereby possibly adding some new productions
  3. eliminate unit-productions and possibly adding other productions
  4. eliminate useless symbols (second time)
If ε ∈ L(G), there is the one ε-production which we have discussed, namely S → ε where S does not appear on the r.h.s. of a derivation. For sake of discussion, we can ignore this one production if it exists.

Therefore, the only productions whose right-hand sides have length less than 2 are the terminal productions, i.e., those of the form:
B → σ ∈ Σ

Make productions which are not terminal productions consist only non-terminals

If a production has length ≥ 1, and contains one or more terminal symbols then we can easily replace this production by one with only non-terminals. Simply create a new non-terminal T, create a new production
T → σ
and replace each occurrence of σ in the right-hand side of any production by T.

Using our running example, which at this stage is:
S → a | Bb | aSa 
A → a | aSa
B → b | aBa
We make these further modifications:
S → a | BY | XSX 
A → a | XSX
B → b | XBX
X → a
Y → b

Make the right-hand-sides have length exactly 2

The final step is to normalize the right hand side of each production which is not a terminal production to have length exactly two. This is very straightforward, simply introduce enough new non-terminals to effect a chaining. For example
A → XZ1
Z1 → SX
The last two steps (especially the last one) are completely artificial. The goal is simply to achieve the normal form result. ■

Example: Convert expression grammar to CNF

Given a simplified version of our expression grammar:
E → E + T | T
T → T * F | F
F → (E) | a
All symbols are useful and there are no ε-productions. So start by eliminating unit productions. In particular eliminate
E → T,  T → F
The additions are based on the facts that:
E ⇒* T,  T ⇒* F,  E ⇒* F
and so we get
E → E + T | T * F | (E) | a
T → T * F | (E) | a
F → (E) | a
Now comes the artificial part. Remove the terminals (, ), *, + using these replacements:
P → +
M → *
L → (
R → )
and getting:
E → EPT | TMF | LER | a
T → TMF | LER | a
F → LER | a
Now to make the right-hand sides have length 2, introduce these replacements:
E → EX,  X → PT
E → TY,  Y → MF
E → LZ,  Z → ER
T → TY
T → LZ
F → LZ
Our final CNF form is this:
E → EX | TY | LZ | a
T → TY | LZ | a
F → LZ | a
X → PT
Y → MF
Z → ER
P → +
M → *
L → (
R → )

The Cocke-Younger-Kasami (CYK) algorithm

With CNF, you can predict how many steps it will take to derive a string based on its length:
length 0: derivable from start symbol only
length 1: 1 step from S
length 2: 1 + 2 = 3 steps, e.g. S ⇒ AB ⇒ aB ⇒ ab
length 3: 2 + 3 = 5 steps: S ⇒ AB ⇒ ABC ⇒3 abc
length n: (n-1) + n = 2n - 1 steps.
Given a grammar G and a string w, we can imagine an algorithm to determine whether w ∈ L(G). We do so by successively deriving all the strings of length |w| in a systematic manner and then see if w is one of those. However, we can do better than this.

The CYK algorithm takes a grammar in Chomsky Normal Form (CNF) and computes the set of all symbols in the grammar which can derive a given string w. After execution of the algorithm we know that w ∈ L if and only if the start symbol S is in this set. This algorithm was the first to provide a polynomial worst-case time, O(n3), for determining whether a given string w ∈ L(G) for a grammar G in CNF.

The idea is to take w = σ1...σn and compute all non-terminals which derive all substrings of w:
wi,j = σi...σi+j-1 = the substring of length j starting at position i
The CYK algorithm computes the following sets of non-terminals:
Ni,j = { A : A ⇒* wi,j }
In particular, w1,n = w, and so N1,n = { A : A ⇒* w }

The algorithm is as follows:
for (i = 1..n)
  Ni,1 = { A : A → σi }
for (j = 2..n)          // j = substring length
  for (i = 1..n-j+1)    // i = starting position
     for (k = 1..j-1)   // k = substring split length
        if A → BC && B ∈ Ni,k && C ∈ Ni+k,j-k
           Ni,j = Ni,j ∪ { A }
Here is an example. Consider the grammar:
S → AB | BC
A → BA | a
B → CC | b
C → AB | a
We want to compute the non-terminals which derive the string w = baaba, and so, as suggested, compute all non-terminals which derive all substrings of w starting from length 1 up to |w|.

The following table illustrates that the non-terminals S,A,C all derive baaba. In particular, baaba ∈ L(G).

position ⇒
5 S,A,C
4 S,A,C
3 BB
length ⇑baaba
Hover the mouse over the cells to see how they are computed!

A cell in this table is identified by (horizontal,vertical) coordinates. The non-terminals at cell (i,j) are those that derive the substring starting at position i of length j. Here are some examples of the computation:
  1. The {S,A} in cell (1,2) is the result of concatenating non-terminals from
    positions (1,1) and (2,1):  BA, BC
    Match these against the right-hand sides of productions and add left-hand sides when a match occurs, getting {S,A}.
  2. The {B} in cell (3,3) is the result of concatenating non-terminals from
    positions (3,1) and (4,2):  AS, AA, CS, CA
    positions (3,2) and (5,1):  SA, SC, CA, CC
    Match these against the right-hand sides of productions and add left-hand sides when a match occurs. The only right-hand side matches is CC which effects the inclusion of left-hand side {B}.
  3. The {S,A,C} in cell (2,4) is the result of concatenating non-terminals from
    positions (2,1) and (3,3):  AB, AC
    positions (2,2) and (4,2):  BS, BA
    positions (2,3) and (5,1):  BA, BC
    Match these against the right-hand sides of productions and add left-hand sides when a match occurs. The right-hand side matches are AB and BA which effect the inclusion of left-hand sides {S,A,C}.

© Robert M. Kline