- For each σ ∈ Σ, there is a tree with root σ and no children; its yield is σ
- For each rule A → ε, there is a tree with root A and one child ε; its yield is ε
- If t1, t2, ..., tn are parse trees with roots r1, r2, ..., rn and respective yields y1, y2, ..., yn, and A → r1r2...rn is a production, then there is a parse tree with root A whose children are t1, t2, ..., tn. Its root is A and its yield is the concatenation of yields: y1y2...yn
E → E + T | E - T | T T → T * F | F F → a | ( E )where a represents an operand of some type, be it a number or variable. The trees are grouped by height.
Abstract Syntax TreeA computer language is basically a context-free language. The symbols are tokens without any particular semantic meaning, namely, all numbers are the same, perhaps even all literal things (strings, numbers, etc) are regarded equally. All variables are regarded equally, etc. So the point is that we have a finite symbol set. The first step of a compiler is to create a parse tree of the program, and the second phase is to assign meaning, or semantics to the entities in the tree. In reality, you create an abstract syntax tree of the the program. For example, considering the parse tree for 4 + 2 * 3 above, an abstract syntax tree for this expression would look like this:
E → E+E | aWe write:
E ⇒ E+E ⇒ E+E+E ⇒ a+E+E ⇒ a+a+E ⇒ a+a+abut this is incomplete, because it doesn't tell us where the replacement rules are applied. We actually need "marked" strings which indicate which non-terminal is replaced in all but the first and last step:
E ⇒ Ě+E ⇒ Ě+E+E ⇒ a+Ě+E ⇒ a+a+Ě ⇒ a+a+aIn this case, the marking is only necessary in the second step; however it is crucial, because we want to distinguish between this derivation and the following one:
E ⇒ E+Ě ⇒ Ě+E+E ⇒ a+Ě+E ⇒ a+a+Ě ⇒ a+a+aWe want to characterize two derivations as "coming from the same parse tree." Technically speaking, we want to characterize a parse tree as an equivalence class of derivations. The first step is to define the relation among derivations as being "more left-oriented at one step". Assume we have two equal length derivations of length n > 2:
D: x1 ⇒ x2 ⇒ ... ⇒ xn D′: x1′ ⇒ x2′ ⇒ ... ⇒ xn′ where x1 = x1′ is a non-terminal and xn = xn′ ∈ Σ*Namely they start with the same non-terminal and end at the same terminal string and have at least two intermediate steps. We want to say D < D′ if the two derivations differ in only one step in which there are 2 non-terminals, A and B, such that D replaces the left one before the right one and D′ does the opposite. Formally:
D < D′ if there exists k, 1 < k < n such that xi = xi′ for all i ≠ k (equal strings, same marked position) xk-1 = uǍvBw, for u, v, w ∈ V* xk-1′ = uAvB̌w, for u, v, w ∈ V* xk = uyvB̌w, for production A → y xk′ = uǍvzw, for production B → z xk+1 = xk+1′ = uyvzw (marking not shown)Two derivations are said to be similar if they belong to the reflexive, symmetric, transitive closure of <. Therefore, two derivations are similar if one can be transformed to the other by a sequence of switching of the order in which rules are applied. Similarity is an equivalence relation and we think of the parse tree of a string as the equivalence class of all similar derivations.
Textbook IssueThe textbook never explicitly says that derivations consist of marked strings, just strings. Unfortunately this would make it so that the initial portion of the derivation
E ⇒ E+E ⇒ E+E+Ecould be interpreted ambiguously as either of these two:
E ⇒ Ě+E ⇒ E+E+E E ⇒ E+Ě ⇒ E+E+E
Basic Expression GrammarConsider again, the grammar specifying only addition in expression:
E → E+E | aThis we have described as being ambiguous. Here are two parse trees for the string "a+a+a":
For the left parse tree, we have these derivations:
L1: E ⇒ Ě+E ⇒ Ě+E+E ⇒ a+Ě+E ⇒ a+a+Ě ⇒ a+a+a (left-most) L2: E ⇒ Ě+E ⇒ E+Ě+E ⇒ Ě+a+E ⇒ a+a+Ě ⇒ a+a+a L3: E ⇒ Ě+E ⇒ E+Ě+E ⇒ E+a+Ě ⇒ Ě+a+a ⇒ a+a+a L4: E ⇒ Ě+E ⇒ E+E+Ě ⇒ E+Ě+a ⇒ Ě+a+a ⇒ a+a+a L5: E ⇒ E+Ě ⇒ Ě+a ⇒ E+Ě+a ⇒ Ě+a+a ⇒ a+a+a (right-most)These are structured so that L1 < L2 < L3 < L4 < L5 and so are all equivalent. For the right parse tree, we have these derivations:
R1: E ⇒ Ě+E ⇒ a+Ě ⇒ a+Ě+E ⇒ a+a+Ě ⇒ a+a+a (left-most) R2: E ⇒ E+Ě ⇒ Ě+E+E ⇒ a+Ě+E ⇒ a+a+Ě ⇒ a+a+a R3: E ⇒ E+Ě ⇒ E+Ě+E ⇒ Ě+a+E ⇒ a+a+Ě ⇒ a+a+a R4: E ⇒ E+Ě ⇒ E+Ě+E ⇒ E+a+Ě ⇒ Ě+a+a ⇒ a+a+a R5: E ⇒ E+Ě ⇒ E+E+Ě ⇒ E+Ě+a ⇒ Ě+a+a ⇒ a+a+a (right-most)Again, R1 < R2 < R3 < R4 < R5 and so are all equivalent. The point is that none of the derivations for the left parse tree are similar to any of the derivations of the right parse tree. For example, these two would be related if it were not for the crucial second step in the derivation:
R3: E ⇒ E+Ě ⇒ E+Ě+E ⇒ Ě+a+E ⇒ a+a+Ě ⇒ a+a+a L3: E ⇒ Ě+E ⇒ E+Ě+E ⇒ E+a+Ě ⇒ Ě+a+a ⇒ a+a+a
Similarity and parse treesWe claim that if D < D′, then both derivations construct the same parse tree in reverse. This is pretty obvious since the tree constructions differ only in two steps, one where A is created before B and the other vice-versa. From this, we can deduce that any two similar derivations create the same parse tree. Going the other way, we want to say that if two derivations construct the same parse tree, they must be similar. This is an inductive argument. Consider the root of the parse tree and its children:
(A t1 t2 ... tn)Inductively, we assume the two derivations are similar for all subtrees: t1, t2, ..., tn. It follows immediately by the transitivity and symmetry of the similarity relation that the two derivations are similar for the whole tree.
Leftmost and rightmost derivationsThe similarity relation, being an equivalence relation, partitions the set of all derivations of a given string. These partitions are finite, and we can argue that, within a partition, there is a unique one DL such that
DL < D, for all D != DL, D is similar to DLDL is said to be a leftmost derivation. It is easy to convince yourself that any derivation of one or two steps must be leftmost. Similarly we can define and argue the existence of a rightmost derivation.
Balanced ParenthesesThe grammar is:
S ⇾ S S | (S) | εConsider the string "()()". It has at several distinct derivations, such as
S ⇒ ŠS ⇒ (Š)S ⇒ ()Š ⇒ ()(Š) ⇒ ()() S ⇒ SŠ ⇒ Š(S) ⇒ (S)(Š) ⇒ (Š)() ⇒ ()()They are similar because they come from the same parse tree:
D1: S ⇒ ŠS ⇒ (Š)S ⇒ ((Š))S ⇒ (())Š ⇒ (())(Š) ⇒ (())() D2: S ⇒ ŠS ⇒ (Š)S ⇒ ((S))Š ⇒ ((Š))(S) ⇒ (())(Š) ⇒ (())() D3: S ⇒ ŠS ⇒ (Š)S ⇒ ((S))Š ⇒ ((S))(Š) ⇒ ((Š))() ⇒ (())()In this case D1 < D2 < D3 This language is ambiguous, and so we also have the parse tree for the string "(())()"
S ⇒ ŠS ⇒ SŠS ⇒ S(Š)S ⇒ S((Š))S ⇒ S(())Š ⇒ S(())(Š) ⇒ Š(())() ⇒ (())()
- There is a derivation from A to w
- There is a parse tree with root A and yield w
- There is a leftmost derivation from A to w
- There is a rightmost derivation from A to w
- there are two distinct parse trees, or
- there two distinct leftmost derivations, or
- there are two distinct rightmost derivations
E → E + E | E * E | a | ( E )Of course, it's not the fault of the language, since we can disambiguate it by incorporating the associativity and operator precedence.
E → E + T | T T → T * F | F F → a | ( E )It's actually very hard to prove that this grammar is not ambiguous! There are actually context-free languages which are inherently ambiguous, in that all possible grammars which generate it are ambiguous.
S → if ( E ) S | if ( E ) S else S S → other E → exprThe string used to illustrate ambiguity is this:
if ( expr ) if ( expr ) other else otherHere are two parse trees for it:
S ⟹ if ( E ) S ⟹ if ( E ) if ( E ) S else S ⟹2 if ( E ) if ( E ) other else other ⟹2 if ( expr ) if ( expr ) other else other
S ⟹ if ( E ) S else S ⟹ if ( E ) S else other ⟹ if ( E ) if ( E ) S else other ⟹ if ( E ) if ( E ) other else other ⟹2 if ( expr ) if ( expr ) other else other