Parsing

Description

The material in this handout corresponds roughly to the textbook material in pages 162-173. The first part, which we call LL Parsing is also called Top-Down Parsing. The second part, which we call LR Parsing is also called Bottom-Up Parsing.

The idea of parsing a language means to create a deterministic procedure (i.e., a DPDA) which creates a derivation for all strings in the language and rejects all others. Removing or dealing with grammar ambiguity is an obvious problem for parsing.

Another issue is the actual creation of the parse tree, or abstract syntax tree, which represents the structure of the string according to the grammar, but without the non-terminals.

LL and LR parsing

We know that finding a derivation of a string should be done in a systematic way as either a left-most or right-most derivation. There are two corresponding common approaches in compiler theory referred to as LL and LR parsing. In both cases the first "L" simply means read the string from the left, which seems pretty obvious.

LL parsing means You need to be able to choose among replacement rules and this choice should be guided by the input tokens. The common technique name is LL(1) which means that you can choose the rule based on 1-token lookahead. The determination of which replacement rule to use is based on grammar computations called the FIRST and FOLLOW sets.

LR parsing means All modern computer language parsers, certainly for any C-like language, use compiler-compilers (yacc, bison) which employ a parsing technique called LALR(1). We need not know the details, but acronym means "1-token LookAhead LR" parsing where the final

LL parsing

The determination as to whether a grammar can achieve LL parsing uses the computations of the so-called FIRST and FOLLOW sets.

The FIRST sets for all symbols

For α ∈ V*, FIRST(α) = the set of terminals (Σ) that begin the strings derived from α.

If α is nullable, i.e., α ⇒* ε, then ε ∈ FIRST(α).

To compute FIRST sets, repeat this process until nothing new is added to any FIRST set:
  1. for each σ ∈ Σ, FIRST(σ) = {σ}
  2. if A ⟶ ε, then FIRST(A) = {ε}
  3. Suppose A → β, where β has more than one symbol. Expand upon this idea:
    A → β = X1X2X3, Xi ∈ V
    
    then FIRST(X1) ⊆ FIRST(A)
    if ε ∈ FIRST(X1), then FIRST(X2) ⊆ FIRST(A)
    if ε ∈ FIRST(X1) ∩ FIRST(X2), then FIRST(X3) ⊆ FIRST(A)
    if ε ∈ FIRST(X1) ∩ FIRST(X2), ∩ FIRST(X3), then ε ∈ FIRST(A)
In particular, ε ∈ FIRST(A) if and only if A is nullable (A ⇒* ε). Remember that nullability does not necessarily appear as the rule A ⟶ ε, e.g., all non-terminals are nullable in this grammar:
A ⟶ BC
B ⟶ ε
C ⟶ D
D ⟶ ε

Assuming that FIRST(ε) = {ε}, we can use this prescription to compute FIRST(β).

The FOLLOW sets

For A ∈ V-Σ, FOLLOW(A) = set of all terminals, possibly including the end marker "$" which can appear after A in a derivation from the start symbol.

Once FIRST has been computed, repeat this process until nothing new is added to any FOLLOW set:
  1. $ ∈ FOLLOW(S)
  2. if A ⟶ αBβ, then FIRST{β) - {ε} ⊆ FOLLOW(B)
  3. if A ⟶ αBβ, and ε ∈ FIRST{β), then FOLLOW(A) ⊆ FOLLOW(B)
  4. if A ⟶ αB, then FOLLOW(A) ⊆ FOLLOW(B)

LL(1) Grammar

An LL(1) grammar should allow us to disambiguate the choice of two rules by looking at the next token in the input string (the lookahead token). Whenever we have
A ⟶ α | β
Then
  1. FIRST(α) ∩ FIRST(β) = ∅
  2. ε ∈ FIRST(α), i.e., A is nullable, then FIRST(β) ∩ FOLLOW(A) = ∅

The grammar for the anbn language is LL(1)

Use the grammar: S ⟶ ε | aSb

Start with:
FIRST(S) = {ε}
Using the rule S → aSb, we get
FIRST(S) = {a, ε}
Now compute FOLLOW sets (usually more involved). Start with
FOLLOW(S) = {$}
Using the rule S → aSb, we get
FOLLOW(S) = {b,$}
The grammar is LL(1). Use α = ε and β = aSb to get
  1. (FIRST(aSb) = {a}) ∩ (FIRST(ε) = {ε}) = ∅
  2. ε ∈ FIRST(ε), and (FIRST(aSb) = {a}) ∩ (FOLLOW(S) = {b,$}) = ∅
And so the parsing algorithm is simple:
stepstackinputrule
0Saabb$
1aSbaabb$S → aSb
2Sbabb$
3aSbbabb$S → aSb
4Sbbbb$
5bbbb$S → ε
6bb$
7$
Observe that when a replacement decision is made, it is determined by the lookahead input character, either The grammar-generated PDA, being non-deterministic, does not express what the parsing algorithm is doing.
We need to achieve the lookahead effect, and to do so, we must use a modified DPDA in which the symbol read takes it to a state dedicated to reading that symbol.
This DPDA makes sense in terms of the grammar analysis. If S is on the stack,
  1. use S ⟶ aSB if the lookahead is a ∈ FIRST(aSb)
  2. use S ⟶ ε if the lookahead token ∈ {b,$} = FOLLOW(S).
This issue is discussed in the textbook on pages 162, 163. We are depicting the same DPDA that is expressed by the chart on the top of page 163 (except the textbook's chart has an error in the omission of the loop transition at the q$ state.

The textbook gives a run for the string ab$; here are the runs for aabb$ and the "empty" string:
stateinputstack
paabb$ε
qaabb$S
qa abb$S
qa abb$aSb
q abb$Sb
qa  bb$Sb
qa  bb$aSbb
q  bb$Sbb
qb   b$Sbb
qb   b$bb
qb    $b
q    $ε
q$    εε
 
stateinputstack
p$ε
q$S
q$εS
q$εε

The modified expression grammar is LL(1)

Consider our unambiguous expression grammar, omitting subtraction:
E → E + T | T 
T → T * F | F  
F → a | ( E )
With this grammar, LL(1) parsing is impossible from the very start. Think about parsing the following string from top down with 1-token lookahead:
4 + 2 * 3
The parser is looking for a replacement for the start symbol E, which should be E → E + T. But how could it know? If the string were just 4, the lookahead token would be the same but the replacement should be E → T. Imagine that the parser correctly makes the replacement, perhaps by looking ahead 2 tokens:
E + T
The parse is still looking to make a replacement for E. This issue is called left recursion, which is associated with left associativity of the grammar.

Removing Left Recursion

Left recursion of a grammar appears in the rule
E → E + T
It means that the non-terminal replacement appears as the first symbol in the replacement string. LL(1) parsing cannot handle left recursion, and so we need to rewrite the grammar in order to remove it. The idea is to consider what can be produced by
E → E + T | T
It's simply the regular language  
T ( + T )*
which we can generate as follows:
E  → T E′
E′ → + T E′ | ε
The full expression grammar replacement for + and * expressions adds the same modification for the T usage. It looks like this:
E  → T E′
E′ → + T E′ | ε
T  → F T′
T′ → * F T′ | ε
F  → a | ( E )

FIRST and FOLLOW sets of the modified grammar

Start with FIRST sets. The computation completes in these 3 phases:

Initially we have
these sets:
FIRST(E) = {}
FIRST(E′) = {ε}
FIRST(T) = {} 
FIRST(T′) = {ε} 
FIRST(F) = {}
Use the right sides
of E', T', F to get:
FIRST(E) = {} 
FIRST(E′) = {+,ε} 
FIRST(T) = {} 
FIRST(T′) = {*,ε} 
FIRST(F) = {(,a}
Use the right sides
of T, then E to get:
FIRST(E) = {(,a} 
FIRST(E′) = {+,ε} 
FIRST(T) = {(,a} 
FIRST(T′) = {*,ε} 
FIRST(F) = {(,a}

Next compute FOLLOW sets. Start here:
FOLLOW(E) = {$}
FOLLOW(E′) = {} 
FOLLOW(T) = {} 
FOLLOW(T′) = {} 
FOLLOW(F) = {}
Then consider effects of these rules:
   E → TE′: FOLLOW(E) ⊆ FOLLOW(E′)
   E' → +TE′: FIRST(E′) - {ε} = {+} ⊆ FOLLOW(T)
   T → FT′: FOLLOW(T) ⊆ FOLLOW(T′)
FOLLOW(E) = {$}
FOLLOW(E′) = {$} 
FOLLOW(T) = {+} 
FOLLOW(T′) = {+} 
FOLLOW(F) = {}
Continuing,
   F → (E): FIRST(')') ⊆ FOLLOW(E)
   T′ → *FT′: FIRST(T′) - {ε} = {*} ⊆ FOLLOW(F)
FOLLOW(E) = {$,)}
FOLLOW(E′) = {$} 
FOLLOW(T) = {+} 
FOLLOW(T′) = {+} 
FOLLOW(F) = {*}
Continuing,
   E → TE′: FOLLOW(E) ⊆ FOLLOW(E′).
   E′ → +TE′: FOLLOW(E′) ⊆ FOLLOW(T) since ε ∈ FIRST(E′)
FOLLOW(E) = {$,)}
FOLLOW(E') = {$,)} 
FOLLOW(T) = {+,$,)} 
FOLLOW(T') = {+} 
FOLLOW(F) = {*}
Continuing,
   T → FT': FOLLOW(T) ⊆ FOLLOW(T′):
   T' → *FT': FOLLOW(T') ⊆ FOLLOW(F), since ε ∈ FIRST(T')
FOLLOW(E) = {$,)}
FOLLOW(E') = {$,)} 
FOLLOW(T) = {+,$,)} 
FOLLOW(T') = {+,$,)} 
FOLLOW(F) = {*,+,$,)}

The modified expression grammar is LL(1)

There are 3 rules with choices:
E′ → + T E′ | ε
T′ → * F T′ | ε
F  → a | ( E )
There are these things to verify:
FIRST(+ T E′) ∩ FIRST(ε) = ∅
FIRST(* F T′) ∩ FIRST(ε) = ∅

FIRST( a ) ∩ FIRST( "( E )" ) = 
{a} ∩ { ( } = ∅

ε ∈ FIRST(E′) and 
FIRST(+ T E′) ∩ FOLLOW(E′) = 
{+} ∩ { $, ) } = ∅ 

ε ∈ FIRST(T′) and
FIRST(* F T′) ∩ FOLLOW(T′) = 
{*} ∩ { +, $, ) } = ∅ 

Sample usage

Here is the derivation of the string
a + a * a $
stepstackinputrule
0Ea + a * a $
1T E′a + a * a $E → T E′
2F T′ E′a + a * a $T → F T′
3a T′ E′a + a * a $F → a
4T′ E′+ a * a $
5E′+ a * a $T′ → ε
6+ T E′+ a * a $E′ → + T E′
7T E′a * a $
8F T′ E′a * a $T → F T′
9a T′ E′a * a $F → a
10T′ E′* a $
11* F T′ E′* a $T′ → * F T′
12F T′ E′a $
13a T′ E′a $F → a
14T′ E′$
15E′$T′ → ε
16$E′ → ε
The marked steps signify replacements made where there is more than one rule choice. Here is the logic for which rule to use: Here is a portion of the DPDA for the lookahead parsing. As you see, we have to get the E′ and T′ rules out of the loop at q and make a decision to use these based on the lookahead.
One issue with LL(1) parsing is that, even though our rewritten grammar can parse the string, what about creating a abstract syntax tree for the original grammar?

LR parsing

If you think about it, going from bottom up actually makes the ambiguity determination much simpler. Parse trees are created from bottom up, and we should be able to tell at each point whether there are two possible parse trees for a given portion of the token string.

Let's consider briefly how 4 + 2 * 3 would be parsed via the LR method using our standard expression grammar.
E → E + T | E - T | T 
T → T * F | F  
F → a | ( E )
Here is the (forward) rightmost derivation of 4 + 2 * 3:
E ⟹ E + T
  ⟹ E + T * F
  ⟹ E + T * 3
  ⟹ E + F * 3
  ⟹ E + 2 * 3
  ⟹ T + 2 * 3
  ⟹ F + 2 * 3
  ⟹ 4 + 2 * 3
The LR parsing derivation is created in reverse. There are two possible operations:
    stack            input             action
    ----             -----             ------
                     4 + 2 * 3
1.  4                + 2 * 3           shift
2.  F                + 2 * 3           reduce by F → 4
3.  T                + 2 * 3           reduce by T → F
4.  E                + 2 * 3           reduce by E → T
5.  E +              2 * 3             shift
6.  E + 2            * 3               shift
7.  E + F            * 3               reduce by F → 2
8.  E + T            * 3               reduce by T → F
9.  E + T *          3                 shift
10. E + T * 3        (empty)           shift
11. E + T * F                          reduce by F → 3
11. E + T                              reduce by T → T * F
12. E                                  reduce by E → E + T
The tricky part is going from step 8 to step 9. At step 8 we conceivably could do a reduce, getting
9.  E                * 3               reduce by E → E + T
However, our 1-token lookahead sees the "*" and knows that there's "nowhere to go" from here:
10. E *               3                shift
Formalizing this approach is all quite complicated, but once it is done it can be shown that our expression grammar is LALR(1), loosely meaning that all these potential conflicts can be resolved with one-token lookahead. The final result is that

Theorem: A grammar which is LALR(1) is unambiguous

Dealing with if/else grammar

The if/else grammar above is ambiguous, but one does not rewrite the grammar. A gimmick is used to "pick the right one" when faced with a shift/reduce conflict.

Consider the parse of our usual ambiguous string:
stack                         input
-----                         -----
                              if (expr) if (expr) other else other
if (expr                      )  if (expr) other else other
if (E                         )  if (expr) other else other
if (E)                        if (expr) other else other
if (E) if                     (expr) other else other
if (E) if (                   expr) other else other
if (E) if (expr               ) other else other
if (E) if (E                  ) other else other
if (E) if (E)                 other else other
if (E) if (E) other           else other
if (E) if (E) S               else other
Now the moment of ambiguity. Should we shift, getting:
if (E) if (E) S else          other
or reduce by S → if (E) S, getting:
if (E) S                      else other
The strategy used in compilation with a shift/reduce conflict is "prefer shift." This leads us to the correct conclusion:
if (E) if (E) S else          other
if (E) if (E) S else other    (empty)
if (E) if (E) S else S        
if (E) S                      
S


© Robert M. Kline