Parse Trees

Parse Trees

A parse tree is an entity which represents the structure of the derivation of a terminal string from some non-terminal (not necessarily the start symbol). The definition is as in the book. Key features to define are the root ∈ V and yield ∈ Σ* of each tree. Observe that parse trees are constructed from bottom up, not top down. The actual construction of "adding children" should be made more precise, but we intuitively know what's going on.

As an example, here are all the parse (sub) trees used to build the parse tree for the arithmetic expression 4 + 3 * 2 using the expression grammar
E → E + T | E - T | T 
T → T * F | F  
F → a | ( E )
where a represents an operand of some type, be it a number or variable. The trees are grouped by height.

Syntactic Expressions for Trees

It's useful to have a non-graphical, syntactic way to express trees as lists. Think of the recursive construction of a tree with its subtrees like this:
(root subtree1 ... subtreen)
We want to create a grammar for trees (potential parse trees) for a given grammar G=(V,ΣR,S). The tree grammar uses these symbols: The rules for creating trees are these:
TS | L            a Tree is a terminal Symbol or a List
L → ( N  R )         the first element of a List is a Non-terminal (the root)
RT R | T | 'ε'    following N, one or more Trees or ε (literal), R serves to unRoll the Trees
N → A ∈ V – Σ       the Non-terminals from V are terminals for this tree grammar.
S → σ ∈ Σ
Not every tree constructed in this way can be a parse tree for the grammar, but this construction gives us all possible parse trees. Let's look at how it would work out for the expression grammar above using the sample expression. These are the list expressions of the above trees:
4    +    2    *    3
(F  4)    (F  2)    (F  3)
(T  (F  4))    (T  (F  2))
(E  (T  (F 4)))
(T  (T  (F 2))  *  (F 3)) 
(E  (E  (T  (F 4)))  +  (T (T (F 2))  *  (F 3)))

Abstract Syntax Tree

A computer language is basically a context-free language. The symbols are tokens without any particular semantic meaning, namely, all numbers are the same, perhaps even all literal things (strings, numbers, etc) are regarded equally. All variables are regarded equally, etc. So the point is that we have a finite symbol set.

The first step of a compiler is to create a parse tree of the program, and the second phase is to assign meaning, or semantics to the entities in the tree. In reality, you create an abstract syntax tree of the the program. For example, considering the parse tree for 4 + 2 * 3 above, an abstract syntax tree for this expression would look like this:

Parse Trees and Derivations

Effectively, a derivation done in reverse represents the construction of a parse tree. An intermediate stage of this reverse construction presents a ordered list of parse trees. Each step backwards creates another ordered list. The construction starts with an ordered list of the singleton trees of the terminal symbols in a target string and ends up being the list with one tree whose root is the a single non-terminal symbol.
[ 4, +, 2, *, 3 ]
[ 4, +, 2, *, (F  3) ]
[ 4, +, (F  2), *, (F  3) ]
[ 4, +, (T (F 2)), *, (F  3) ]
[ 4, +, (T (T (F 2)) * (F 3)) ]
[ (F 4), +, (T (T (F 2)) * (F 3)) ]
[ (T (F 4)), +, (T (T (F 2)) * (F 3)) ]
[ (E (T (F 4))), +, (T (T (F 2)) * (F 3)) ]
[ (E (E (T (F 4)) + (T (T (F 2)) * (F 3))) ]

In this way, a parse tree represents many possible derivations. We want to characterize two derivations "coming from the same parse tree." Technically speaking, we want to characterize a parse tree as an equivalence class of derivations. Although intuitively obvious, as you know, nothing is easy when you try to make it precise.

The first step is to define the relation "more left-oriented at one step". Assume we have two equal length derivations of length n > 2:
D:  x1  ⇒ x2  ⇒ ... ⇒ xn
D′: x1′ ⇒ x2′ ⇒ ... ⇒ xnwhere 
  x1 = x1is a non-terminal
and
  xn = xn′ ∈ Σ*
Namely they start at the same non-terminal and end at the same terminal string and have at least two intermediate steps.

We want to say D < D′ if the two derivations differ in only one step in which there are 2 non-terminals such that D replaces the left one before the right one and D′ does the opposite. Formally:
D < D′ if there exists k, 1 < k < n such that
   xi = xifor all i ≠ k
   xk-1 = xk-1′ = uAvBw, for u, v, w ∈ V*
   xk  = uyvBw, for production A → y  
   xk′ = uAvzw, for production B → z
Say that two derivations are similar if they belong to the reflexive, symmetric, transitive closure of <. Therefore, two derivations are similar if one can be transformed to the other by a sequence of switching of the order in which rules are applied.

Examples

Consider the ambiguous arithmetic expression grammar:
E → E+E | E-E | E*E | (E) | a
Consider these sample derivations:
D1: E ⇒ E+E ⇒ E-E+E ⇒ x-E+E ⇒ x-y+E ⇒ x-y+z
D2: E ⇒ E+E ⇒ E-E+E ⇒ E-y+E ⇒ x-y+E ⇒ x-y+z
D3: E ⇒ E+E ⇒ E-E+E ⇒ E-y+E ⇒ E-y+z ⇒ x-y+z
We see that D1 < D2 < D3 and therefore these three are mutually similar. Then consider:
D4: E ⇒ E-E ⇒ E-E+E ⇒ x-E+E ⇒ x-y+E ⇒ x-y+z
Although D4 is identical to D1 except for the first step, it is not similar to D1. This is because D1 and D4 come from different parse trees.

Similarity and parse trees

We start off by claiming that D < D′, then both derivations construct the same parse tree in reverse. This is pretty obvious since the tree constructions differ only in two steps, one where A is created before B and the other vice-versa. From this, we can deduce that any two similar derivations create the same parse tree.

Going the other way, we want to say that if two derivations construct the same parse tree, they must be similar. This is an inductive argument. Consider the root of the parse tree and its children:
(S t1 t2 ... tn)
Inductively, we assume the two derivations are similar for all subtrees: t1, t2, ..., tn. It follows immediately by the transitivity and symmetry of the similarity relation that the two derivations are similar for the whole tree.

Leftmost and rightmost derivations

The similarity relation, being an equivalence relation, partitions the set of all derivations of a given string. These partitions are finite, and we can argue that, within a partition, there is a unique one DL such that
DL < D, for all D != DL, D is similar to DL
DL is said to be a leftmost derivation. It is easy to convince yourself that any derivation of one or two steps must be leftmost. Similarly we can define and argue the existence of a rightmost derivation.

Summary

Theorem 3.2.1 in the textbook summarizes the various equivalent ways on understanding strings derived from a grammar. For any non-terminal A, and string w ∈ Σ*
  1. There is a derivation from A to w
  2. There is a parse tree with root A and yield w
  3. There is a leftmost derivation from A to w
  4. There is a rightmost derivation from A to w
A grammar G is ambiguous if there is a string in L(G) for which This expression grammar is ambiguous:
E → E + E | E * E | a | ( E )
Of course, it's not the fault of the language, since we can disambiguate it by incorporating the associativity and operator precedence.
E → E + T | T
T → T * F | F
F → a | ( E )
It's actually very hard to prove that this grammar is not ambiguous! We will talk about it later. A context-free language is called inherently ambiguous if all possible grammars which generate it are ambiguous. It turns out that such languages do exist!

The ambiguous if-else grammar

Given the grammar:
S → if ( E ) S | if ( E ) S else S
S → other
E → expr
The string used to illustrate ambiguity is this:
if ( expr ) if ( expr ) other else other
Here are two parse trees for it:

We'll show two different rightmost derivations:
S ⟹  if ( E ) S
  ⟹  if ( E ) if ( E ) S else S
  ⟹2 if ( E ) if ( E ) other else other
  ⟹2 if ( expr ) if ( expr ) other else other
S ⟹  if ( E ) S else S
  ⟹  if ( E ) S else other
  ⟹  if ( E ) if ( E ) S else other
  ⟹  if ( E ) if ( E ) other else other
  ⟹2 if ( expr ) if ( expr ) other else other


© Robert M. Kline