Parse Trees

Parse Trees

A parse tree is an entity which represents the structure of the derivation of a terminal string from some non-terminal (not necessarily the start symbol). The definition is as in the book. Key features to define are the root ∈ V and yield ∈ Σ* of each tree. Observe that parse trees are constructed from bottom up, not top down. The actual construction of "adding children" should be made more precise, but we intuitively know what's going on.

As an example, here are all the parse (sub) trees used to build the parse tree for the arithmetic expression 4 + 2 * 3 using the expression grammar
E → E + T | E - T | T 
T → T * F | F  
F → a | ( E )
where a represents an operand of some type, be it a number or variable. The trees are grouped by height.

Abstract Syntax Tree

A computer language is basically a context-free language. The symbols are tokens without any particular semantic meaning, namely, all numbers are the same, perhaps even all literal things (strings, numbers, etc) are regarded equally. All variables are regarded equally, etc. So the point is that we have a finite symbol set.

The first step of a compiler is to create a parse tree of the program, and the second phase is to assign meaning, or semantics to the entities in the tree. In reality, you create an abstract syntax tree of the the program. For example, considering the parse tree for 4 + 2 * 3 above, an abstract syntax tree for this expression would look like this:

Parse Trees and Derivations

First of all, what is a derivation? It's a sequence of strings in V* which starts with a non-terminal (in V-Σ and ends with a string in Σ*. Let's consider the sample grammar
E → E+E | a
We write:
E ⇒ E+E ⇒ E+E+E ⇒ a+E+E ⇒ a+a+E ⇒ a+a+a
but this is incomplete, because it doesn't tell us where the replacement rules are applied. We actually need "marked" strings which indicate which non-terminal is replaced in all but the first and last step:
E ⇒ Ě+E ⇒ Ě+E+E ⇒ a+Ě+E ⇒ a+a+Ě ⇒ a+a+a
In this case, the marking is only necessary in the second step; however it is crucial, because we want to distinguish between this derivation and the following one:
E ⇒ E+Ě ⇒ Ě+E+E ⇒ a+Ě+E ⇒ a+a+Ě ⇒ a+a+a
We want to characterize two derivations as "coming from the same parse tree." Technically speaking, we want to characterize a parse tree as an equivalence class of derivations.

The first step is to define the relation among derivations as being "more left-oriented at one step". Assume we have two equal length derivations of length n > 2:
D:  x1  ⇒ x2  ⇒ ... ⇒ xn
D′: x1′ ⇒ x2′ ⇒ ... ⇒ xnwhere 
  x1 = x1is a non-terminal
and
  xn = xn′ ∈ Σ*
Namely they start with the same non-terminal and end at the same terminal string and have at least two intermediate steps.

We want to say D < D′ if the two derivations differ in only one step in which there are 2 non-terminals, A and B, such that D replaces the left one before the right one and D′ does the opposite. Formally:
D < D′ if there exists k, 1 < k < n such that
   xi = xifor all i ≠ k (equal strings, same marked position)
   xk-1 = uǍvBw, for u, v, w ∈ V*
   xk-1′ = uAvB̌w, for u, v, w ∈ V*
   xk  = uyvB̌w, for production A → y  
   xk′ = uǍvzw, for production B → z
   xk+1 = xk+1′ = uyvzw (marking not shown)
Two derivations are said to be similar if they belong to the reflexive, symmetric, transitive closure of <. Therefore, two derivations are similar if one can be transformed to the other by a sequence of switching of the order in which rules are applied. Similarity is an equivalence relation and we think of the parse tree of a string as the equivalence class of all similar derivations.

Textbook Issue

The textbook never explicitly says that derivations consist of marked strings, just strings. Unfortunately this would make it so that the initial portion of the derivation
E ⇒ E+E ⇒ E+E+E 
could be interpreted ambiguously as either of these two:
E ⇒ Ě+E ⇒ E+E+E 
E ⇒ E+Ě ⇒ E+E+E 

Basic Expression Grammar

Consider again, the grammar specifying only addition in expression:
E → E+E | a
This we have described as being ambiguous. Here are two parse trees for the string "a+a+a":

For the left parse tree, we have these derivations:
L1: E ⇒ Ě+E ⇒ Ě+E+E ⇒ a+Ě+E ⇒ a+a+Ě ⇒ a+a+a (left-most)
L2: E ⇒ Ě+E ⇒ E+Ě+E ⇒ Ě+a+E ⇒ a+a+Ě ⇒ a+a+a 
L3: E ⇒ Ě+E ⇒ E+Ě+E ⇒ E+a+Ě ⇒ Ě+a+a ⇒ a+a+a 
L4: E ⇒ Ě+E ⇒ E+E+Ě ⇒ E+Ě+a ⇒ Ě+a+a ⇒ a+a+a 
L5: E ⇒ E+Ě ⇒ Ě+a ⇒ E+Ě+a ⇒ Ě+a+a ⇒ a+a+a (right-most)
These are structured so that L1 < L2 < L3 < L4 < L5 and so are all equivalent.

For the right parse tree, we have these derivations:
R1: E ⇒ Ě+E ⇒ a+Ě ⇒ a+Ě+E ⇒ a+a+Ě ⇒ a+a+a (left-most)
R2: E ⇒ E+Ě ⇒ Ě+E+E ⇒ a+Ě+E ⇒ a+a+Ě ⇒ a+a+a 
R3: E ⇒ E+Ě ⇒ E+Ě+E ⇒ Ě+a+E ⇒ a+a+Ě ⇒ a+a+a 
R4: E ⇒ E+Ě ⇒ E+Ě+E ⇒ E+a+Ě ⇒ Ě+a+a ⇒ a+a+a 
R5: E ⇒ E+Ě ⇒ E+E+Ě ⇒ E+Ě+a ⇒ Ě+a+a ⇒ a+a+a (right-most)
Again, R1 < R2 < R3 < R4 < R5 and so are all equivalent. The point is that none of the derivations for the left parse tree are similar to any of the derivations of the right parse tree. For example, these two would be related if it were not for the crucial second step in the derivation:
R3: E ⇒ E+Ě ⇒ E+Ě+E ⇒ Ě+a+E ⇒ a+a+Ě ⇒ a+a+a 
L3: E ⇒ Ě+E ⇒ E+Ě+E ⇒ E+a+Ě ⇒ Ě+a+a ⇒ a+a+a 

Similarity and parse trees

We claim that if D < D′, then both derivations construct the same parse tree in reverse. This is pretty obvious since the tree constructions differ only in two steps, one where A is created before B and the other vice-versa. From this, we can deduce that any two similar derivations create the same parse tree.

Going the other way, we want to say that if two derivations construct the same parse tree, they must be similar. This is an inductive argument. Consider the root of the parse tree and its children:
(A t1 t2 ... tn)
Inductively, we assume the two derivations are similar for all subtrees: t1, t2, ..., tn. It follows immediately by the transitivity and symmetry of the similarity relation that the two derivations are similar for the whole tree.

Leftmost and rightmost derivations

The similarity relation, being an equivalence relation, partitions the set of all derivations of a given string. These partitions are finite, and we can argue that, within a partition, there is a unique one DL such that
DL < D, for all D != DL, D is similar to DL
DL is said to be a leftmost derivation. It is easy to convince yourself that any derivation of one or two steps must be leftmost. Similarly we can define and argue the existence of a rightmost derivation.

Balanced Parentheses

The grammar is:
S ⇾ S S | (S) | ε
Consider the string "()()". It has at several distinct derivations, such as
S ⇒ ŠS ⇒ (Š)S ⇒ ()Š ⇒ ()(Š) ⇒ ()() 
S ⇒ SŠ ⇒ Š(S) ⇒ (S)(Š) ⇒ (Š)() ⇒ ()() 
They are similar because they come from the same parse tree:
Then consider this parse tree for the string "(())()"
Here are several similar derivations:
D1: S ⇒ ŠS ⇒ (Š)S ⇒ ((Š))S ⇒ (())Š ⇒ (())(Š) ⇒ (())()  
D2: S ⇒ ŠS ⇒ (Š)S ⇒ ((S))Š ⇒ ((Š))(S) ⇒ (())(Š) ⇒ (())()  
D3: S ⇒ ŠS ⇒ (Š)S ⇒ ((S))Š ⇒ ((S))(Š) ⇒ ((Š))() ⇒ (())()  
In this case D1 < D2 < D3

This language is ambiguous, and so we also have the parse tree for the string "(())()"
which has a derivation unrelated to the others:
S ⇒ ŠS ⇒ SŠS ⇒ S(Š)S ⇒ S((Š))S ⇒ S(())Š ⇒ S(())(Š) ⇒ Š(())() ⇒ (())()

Summary

Theorem 3.2.1 in the textbook summarizes the various equivalent ways to understand strings derived from a grammar. For any non-terminal A, and string w ∈ Σ*
  1. There is a derivation from A to w
  2. There is a parse tree with root A and yield w
  3. There is a leftmost derivation from A to w
  4. There is a rightmost derivation from A to w
A grammar G is ambiguous if there is a string in L(G) for which This expression grammar is ambiguous:
E → E + E | E * E | a | ( E )
Of course, it's not the fault of the language, since we can disambiguate it by incorporating the associativity and operator precedence.
E → E + T | T
T → T * F | F
F → a | ( E )
It's actually very hard to prove that this grammar is not ambiguous! There are actually context-free languages which are inherently ambiguous, in that all possible grammars which generate it are ambiguous.

The ambiguous if-else grammar

Given the grammar:
S → if ( E ) S | if ( E ) S else S
S → other
E → expr
The string used to illustrate ambiguity is this:
if ( expr ) if ( expr ) other else other
Here are two parse trees for it:

We'll show two different rightmost derivations:
S ⟹  if ( E ) S
  ⟹  if ( E ) if ( E ) S else S
  ⟹2 if ( E ) if ( E ) other else other
  ⟹2 if ( expr ) if ( expr ) other else other
S ⟹  if ( E ) S else S
  ⟹  if ( E ) S else other
  ⟹  if ( E ) if ( E ) S else other
  ⟹  if ( E ) if ( E ) other else other
  ⟹2 if ( expr ) if ( expr ) other else other


© Robert M. Kline