state  input  stack 
p  aabb$  ε 
q  aabb$  S 
q_{a}  abb$  S 
q_{a}  abb$  aSb 
q  abb$  Sb 
q_{a}  bb$  Sb 
q_{a}  bb$  aSbb 
q  bb$  Sbb 
q_{b}  b$  Sbb 
q_{b}  b$  bb 
q_{b}  $  b 
q  $  ε 
q_{$}  ε  ε 
Description
The material in this handout corresponds roughly to the textbook material in pages 162173. The first part, which we call LL Parsing is also called TopDown Parsing. The second part, which we call LR Parsing is also called BottomUp Parsing. The idea of parsing a language means to create a deterministic procedure (i.e., a DPDA) which creates a derivation for all strings in the language and rejects all others. Removing or dealing with grammar ambiguity is an obvious problem for parsing. Another issue is the actual creation of the parse tree, or abstract syntax tree, which represents the structure of the string according to the grammar, but without the nonterminals.LL and LR parsing
We know that finding a derivation of a string should be done in a systematic way as either a leftmost or rightmost derivation. There are two corresponding common approaches in compiler theory referred to as LL and LR parsing. In both cases the first "L" simply means read the string from the left, which seems pretty obvious. LL parsing means process the tokens from the Left (the way they would come in naturally from a file)
 construct the derivation from top down, starting from the start symbol
 construct a Leftmost derivation
 process the tokens from the Left (the way they would come in naturally from a file)
 construct a Rightmost derivation in reverse
 construct the parse tree from bottom up, using all the symbols and arriving at the start symbol
LL parsing
The determination as to whether a grammar can achieve LL parsing uses the computations of the socalled FIRST and FOLLOW sets.The FIRST sets for all symbols
For α ∈ V*, FIRST(α) = the set of terminals (Σ) that begin the strings derived from α. If α is nullable, i.e., α ⇒* ε, then ε ∈ FIRST(α). To compute FIRST sets, repeat this process until nothing new is added to any FIRST set: for each σ ∈ Σ, FIRST(σ) = {σ}
 if A ⟶ ε, then FIRST(A) = {ε}

Suppose A → β, where β has more than one symbol.
Expand upon this idea:
A → β = X_{1}X_{2}X_{3}, X_{i} ∈ V
then FIRST(X_{1}) ⊆ FIRST(A)
if ε ∈ FIRST(X_{1}), then FIRST(X_{2}) ⊆ FIRST(A)
if ε ∈ FIRST(X_{1}) ∩ FIRST(X_{2}), then FIRST(X_{3}) ⊆ FIRST(A)
if ε ∈ FIRST(X_{1}) ∩ FIRST(X_{2}), ∩ FIRST(X_{3}), then ε ∈ FIRST(A)
A ⟶ BC B ⟶ ε C ⟶ D D ⟶ εAssuming that FIRST(ε) = {ε}, we can use this prescription to compute FIRST(β).
The FOLLOW sets
For A ∈ VΣ, FOLLOW(A) = set of all terminals, possibly including the end marker "$" which can appear after A in a derivation from the start symbol. Once FIRST has been computed, repeat this process until nothing new is added to any FOLLOW set: $ ∈ FOLLOW(S)
 if A ⟶ αBβ, then FIRST{β)  {ε} ⊆ FOLLOW(B)
 if A ⟶ αBβ, and ε ∈ FIRST{β), then FOLLOW(A) ⊆ FOLLOW(B)
 if A ⟶ αB, then FOLLOW(A) ⊆ FOLLOW(B)
LL(1) Grammar
An LL(1) grammar should allow us to disambiguate the choice of two rules by looking at the next token in the input string (the lookahead token). Whenever we haveA ⟶ α  βThen
 FIRST(α) ∩ FIRST(β) = ∅
 ε ∈ FIRST(α), i.e., A is nullable, then FIRST(β) ∩ FOLLOW(A) = ∅
The grammar for the a^{n}b^{n} language is LL(1)
Use the grammar: S ⟶ ε  aSb Start with:FIRST(S) = {ε}Using the rule S → aSb, we get
FIRST(S) = {a, ε}Now compute FOLLOW sets (usually more involved). Start with
FOLLOW(S) = {$}Using the rule S → aSb, we get
FOLLOW(S) = {b,$}The grammar is LL(1). Use α = ε and β = aSb to get
 (FIRST(aSb) = {a}) ∩ (FIRST(ε) = {ε}) = ∅
 ε ∈ FIRST(ε), and (FIRST(aSb) = {a}) ∩ (FOLLOW(S) = {b,$}) = ∅
step  stack  input  rule 
0  S  aabb$  
1  aSb  aabb$  S → aSb 
2  Sb  abb$  
3  aSbb  abb$  S → aSb 
4  Sbb  bb$  
5  bb  bb$  S → ε 
6  b  b$  
7  $ 
 a: replace S → aSb, because a ∈ FIRST(aSb)
 b: replace S → ε, because b ∈ FOLLOW(S)
 use S ⟶ aSB if the lookahead is a ∈ FIRST(aSb)
 use S ⟶ ε if the lookahead token ∈ {b,$} = FOLLOW(S).
state  input  stack 
p  $  ε 
q  $  S 
q_{$}  ε  S 
q_{$}  ε  ε 
The modified expression grammar is LL(1)
Consider our unambiguous expression grammar, omitting subtraction:E → E + T  T T → T * F  F F → a  ( E )With this grammar, LL(1) parsing is impossible from the very start. Think about parsing the following string from top down with 1token lookahead:
4 + 2 * 3The parser is looking for a replacement for the start symbol E, which should be E → E + T. But how could it know? If the string were just 4, the lookahead token would be the same but the replacement should be E → T. Imagine that the parser correctly makes the replacement, perhaps by looking ahead 2 tokens:
E + TThe parse is still looking to make a replacement for E. This issue is called left recursion, which is associated with left associativity of the grammar.
Removing Left Recursion
Left recursion of a grammar appears in the ruleE → E + TIt means that the nonterminal replacement appears as the first symbol in the replacement string. LL(1) parsing cannot handle left recursion, and so we need to rewrite the grammar in order to remove it. The idea is to consider what can be produced by
E → E + T  TIt's simply the regular language
T ( + T )*which we can generate as follows:
E → T E′ E′ → + T E′  εThe full expression grammar replacement for + and * expressions adds the same modification for the T usage. It looks like this:
E → T E′ E′ → + T E′  ε T → F T′ T′ → * F T′  ε F → a  ( E )
FIRST and FOLLOW sets of the modified grammar
Start with FIRST sets. The computation completes in these 3 phases:
Initially we have
these sets:
these sets:
FIRST(E) = {} FIRST(E′) = {ε} FIRST(T) = {} FIRST(T′) = {ε} FIRST(F) = {}
Use the right sides
of E', T', F to get:
of E', T', F to get:
FIRST(E) = {} FIRST(E′) = {+,ε} FIRST(T) = {} FIRST(T′) = {*,ε} FIRST(F) = {(,a}
Use the right sides
of T, then E to get:
of T, then E to get:
FIRST(E) = {(,a} FIRST(E′) = {+,ε} FIRST(T) = {(,a} FIRST(T′) = {*,ε} FIRST(F) = {(,a}
Next compute FOLLOW sets. Start here:
FOLLOW(E) = {$} FOLLOW(E′) = {} FOLLOW(T) = {} FOLLOW(T′) = {} FOLLOW(F) = {}Then consider effects of these rules:
E → TE′: FOLLOW(E) ⊆ FOLLOW(E′)
E' → +TE′: FIRST(E′)  {ε} = {+} ⊆ FOLLOW(T)
T → FT′: FOLLOW(T) ⊆ FOLLOW(T′)
FOLLOW(E) = {$} FOLLOW(E′) = {$} FOLLOW(T) = {+} FOLLOW(T′) = {+} FOLLOW(F) = {}Continuing,
F → (E): FIRST(')') ⊆ FOLLOW(E)
T′ → *FT′: FIRST(T′)  {ε} = {*} ⊆ FOLLOW(F)
FOLLOW(E) = {$,)} FOLLOW(E′) = {$} FOLLOW(T) = {+} FOLLOW(T′) = {+} FOLLOW(F) = {*}Continuing,
E → TE′: FOLLOW(E) ⊆ FOLLOW(E′).
E′ → +TE′: FOLLOW(E′) ⊆ FOLLOW(T) since ε ∈ FIRST(E′)
FOLLOW(E) = {$,)} FOLLOW(E') = {$,)} FOLLOW(T) = {+,$,)} FOLLOW(T') = {+} FOLLOW(F) = {*}Continuing,
T → FT': FOLLOW(T) ⊆ FOLLOW(T′):
T' → *FT': FOLLOW(T') ⊆ FOLLOW(F), since ε ∈ FIRST(T')
FOLLOW(E) = {$,)} FOLLOW(E') = {$,)} FOLLOW(T) = {+,$,)} FOLLOW(T') = {+,$,)} FOLLOW(F) = {*,+,$,)}
The modified expression grammar is LL(1)
There are 3 rules with choices:E′ → + T E′  ε T′ → * F T′  ε F → a  ( E )There are these things to verify:
FIRST(+ T E′) ∩ FIRST(ε) = ∅ FIRST(* F T′) ∩ FIRST(ε) = ∅ FIRST( a ) ∩ FIRST( "( E )" ) = {a} ∩ { ( } = ∅ ε ∈ FIRST(E′) and FIRST(+ T E′) ∩ FOLLOW(E′) = {+} ∩ { $, ) } = ∅ ε ∈ FIRST(T′) and FIRST(* F T′) ∩ FOLLOW(T′) = {*} ∩ { +, $, ) } = ∅
Sample usage
Here is the derivation of the stringa + a * a $
step  stack  input  rule 
0  E  a + a * a $  
1  T E′  a + a * a $  E → T E′ 
2  F T′ E′  a + a * a $  T → F T′ 
3  a T′ E′  a + a * a $  F → a 
4  T′ E′  + a * a $  
5  E′  + a * a $  T′ → ε 
6  + T E′  + a * a $  E′ → + T E′ 
7  T E′  a * a $  
8  F T′ E′  a * a $  T → F T′ 
9  a T′ E′  a * a $  F → a 
10  T′ E′  * a $  
11  * F T′ E′  * a $  T′ → * F T′ 
12  F T′ E′  a $  
13  a T′ E′  a $  F → a 
14  T′ E′  $  
15  E′  $  T′ → ε 
16  $  E′ → ε 

At steps 3, 9, 13 we have done replacements for F.
In each case the lookahead symbol is a.
We need to choose between
F → a and F → ( E )The choice is obvious; technically a ∉ FIRST( "( E )" )

At step 5, the lookahead symbol is + and we chose between
T′ → * F T′ and T′ → εThe winner was T′ → ε because + ∉ FIRST(* F T′), but + ∈ FOLLOW(T′).

Going to step 11, the lookahead symbol is * and chose between
T′ → * F T′ and T′ → εThe winner was T′ → * F T′ because * ∈ FIRST(* F T′), but * ∉ FOLLOW(T′).
 Going to steps 15 and 16, all real tokens have been consumed; only the terminal $ remains. The obvious choices of rules are T′ → ε and E′ → ε, respectively. The confirmations of success is that $ ∈ FOLLOW(T′) and $ ∈ FOLLOW(E′).
LR parsing
If you think about it, going from bottom up actually makes the ambiguity determination much simpler. Parse trees are created from bottom up, and we should be able to tell at each point whether there are two possible parse trees for a given portion of the token string. Let's consider briefly how 4 + 2 * 3 would be parsed via the LR method using our standard expression grammar.E → E + T  E  T  T T → T * F  F F → a  ( E )Here is the (forward) rightmost derivation of 4 + 2 * 3:
E ⟹ E + T ⟹ E + T * F ⟹ E + T * 3 ⟹ E + F * 3 ⟹ E + 2 * 3 ⟹ T + 2 * 3 ⟹ F + 2 * 3 ⟹ 4 + 2 * 3The LR parsing derivation is created in reverse. There are two possible operations:
 shift: a token from the input moves onto a working stack
 reduce: a portion of the stack (tokens and nonterminals) matches the righthand side of a rule and is replaced by the lefthand side (a nonterminal)
stack input action    4 + 2 * 3 1. 4 + 2 * 3 shift 2. F + 2 * 3 reduce by F → 4 3. T + 2 * 3 reduce by T → F 4. E + 2 * 3 reduce by E → T 5. E + 2 * 3 shift 6. E + 2 * 3 shift 7. E + F * 3 reduce by F → 2 8. E + T * 3 reduce by T → F 9. E + T * 3 shift 10. E + T * 3 (empty) shift 11. E + T * F reduce by F → 3 11. E + T reduce by T → T * F 12. E reduce by E → E + TThe tricky part is going from step 8 to step 9. At step 8 we conceivably could do a reduce, getting
9. E * 3 reduce by E → E + THowever, our 1token lookahead sees the "*" and knows that there's "nowhere to go" from here:
10. E * 3 shiftFormalizing this approach is all quite complicated, but once it is done it can be shown that our expression grammar is LALR(1), loosely meaning that all these potential conflicts can be resolved with onetoken lookahead. The final result is that
Theorem: A grammar which is LALR(1) is unambiguous
Dealing with if/else grammar
The if/else grammar above is ambiguous, but one does not rewrite the grammar. A gimmick is used to "pick the right one" when faced with a shift/reduce conflict.Consider the parse of our usual ambiguous string:
stack input   if (expr) if (expr) other else other if (expr ) if (expr) other else other if (E ) if (expr) other else other if (E) if (expr) other else other if (E) if (expr) other else other if (E) if ( expr) other else other if (E) if (expr ) other else other if (E) if (E ) other else other if (E) if (E) other else other if (E) if (E) other else other if (E) if (E) S else otherNow the moment of ambiguity. Should we shift, getting:
if (E) if (E) S else otheror reduce by S → if (E) S, getting:
if (E) S else otherThe strategy used in compilation with a shift/reduce conflict is "prefer shift." This leads us to the correct conclusion:
if (E) if (E) S else other if (E) if (E) S else other (empty) if (E) if (E) S else S if (E) S S