Languages and Regular Languages

Alphabets and Languages: section 1.7

An alphabet is any finite set, Σ; its elements are called symbols. For the most part we want to regard these as "single character" entities, but this is not necessary. Two fundamental things to define are "strings of symbols" and the "concatenation" operation. Although these are intuitively obvious, it is incumbent of us to regard them with some level of formalism.

A string w over Σ is a finite sequence of symbols. This means:
w : {1..n} ⇾ Σ, for some n ≥ 0
where, for n ≥ 0, {1..n} = { i ∈ N : i ≤ n }. Consequently, for n = 0, {1..n} = ∅

For example, given Σ = {a..z}
"hello": {1..5} ⇾ Σ is { (1,h), (2,e), (3,l), (4,l), (5,o) }
The empty string function (n = 0) is, in fact, the empty set, but we denote it by the special symbol ε. Other authors prefer λ.

The length of a string w, written |w|, is the number n = the number of tuples in the function. In particular, the length of the empty string is 0.

For a non-empty string w, w(i) = the symbol in the i-th position when written w = a1a2...an (the 1-based left-to-right presentation is only a convenience).

An individual symbol, σ, represents the string (1,σ).

The set of all strings is denoted by Σ*. A language over Σ is any subset of Σ*, i.e., an element of 2Σ*.

Concatenation

Given strings x, y, concatentation is the binary operation x·y:
   x·y: {1..(|x|+|y|)} ⇾ Σ

   x·y(i) = x(i),     if 0 < i ≤ |x|
          = y(i-|x|), if |x| < i ≤ |x|+|y|
Consequently, it is easily verified that ε is the concatenation identity element:
for all strings x, ε·x = x·ε = x
The other intuitively obvious, but important, property of string concatenation is that of associativity, i.e.,
(x·y)·z = x·(y·z)
An argument needs to be made that this is, in fact true. This is left as an exercise in the textbook which is incredibly tedious.

Languages & Operations

A language is a set, and so any set operation is a valid language operation. This includes union, intersection, complement, etc. Individual strings can be singleton sets, and we're particular interested in the these basis sets:
∅, 
ε = { ε }, 
σ = { σ }, for each σ ∈ Σ
We can also define language concatenation
L1·L2 = { w : w = x·y for some x ∈ L1 and y ∈ L2 }
What is the identity?   ε = { ε }

Likewise, we can define, for n > 0:
Ln = { x1x2...xn : where xi ∈ L, i = 1..n }
The definition of L0 is a formality of sorts. It's the concatenation identity, like the numerical x0 = 1.
L0 = { ε }

Inductive definition of Ln

To be more precise, we state an inductive definition of Ln
L0 = { ε }
Ln = L · Ln-1, n ≥ 1

Language operation algebra

With language concatenation and set union, we have a system that appears very algebraic:
∪ (union)          is like numerical +, with identity ∅
· (concatenation)  is like numerical *, with identity ε
We have the common algebraic identity properties:
L ∪ ∅ = L,  L · ∅ = ∅,  L · ε = L
We also have the associativity properties:
( L1 ∪ L2 ) ∪ L3 = L1 ∪ ( L2 ∪ L3 ) 
( L1 · L2 ) · L3 = L1 · ( L2 · L3 ) 
The associativity of · for languages follows from the associativity for individual strings. It's also easy to prove the distributive property:
L1 · ( L2 ∪ L3 ) = L1 · L2  ∪  L1 · L3

Kleene Star Operation

Up to this point we can only create finite languages. Its the new operation Kleene Star which creates infinite languages:
L* = { w : w ∈ Ln, for some n ≥ 0 }
Note the consistency w.r.t. our usage of Σ* as the set of all strings. Just think of symbols in Σ as strings. The Kleene Star operation is responsible for generating infinite languages.

A more precise definition of Kleene Star is done via closure: L* is the closure under these properties:
{ ε } ⊆ L*
L ⊆ L*
∀ L1, L2 ⊆ L*, L1 · L2  ⊆ L*

Regular Expressions and Languages: section 1.8

We're interested in constructing languages based on the operations union, concatenation, and Kleene*. There are two notions to explore: We want to equate these two concepts in practice, but it's important to separate them initially.

Regular Languages via closure

A purely mathematical approach is sufficient to define regular languages. Consider the set R of regular languages. R is a subset of the set of all languages, 2Σ*. Define R as the closure subject to these properties:
∅ ∈ R, 
σ ∈ R, ∀ σ ∈ Σ
L1, L2R ⇒ L1 ∪ L2R
L1, L2R ⇒ L1 · L2R
L ∈ R ⇒ L*R
Keep in mind that ∅* = ε and so there is no need to explicitly require:
ε ∈ R
It's complicated to say what the closure is, because your working with sets of languages, not languages, but here goes:
The set of regular languages,
R = {  All sets of languages, S :
∀ S1, S2 ∈ S :
∅ ∈ S1  and  ∅ ∈ S2
σ ∈ S1  and  σ ∈ S2,  ∀ σ ∈ Σ
S1 ∩ S2 ∈ S
S1 · S2 ∈ S
S1* ∈ S

Regular Expressions by the textbook

A regular expression is a string consisting of the symbols:
∅, σ ∈ Σ, ε, *, ∪, (, )
As indicated in the textbook, ε is not really necessary since it is equivalent to the expression ∅*. The textbook's definition of a regular expression is via a grammar. Written in more strict grammatical notation, we would write:
1) R ⇾ (RR)
2) R ⇾ (RR)
3) RR*
4) R ⇾ ∅, σ ∈ Σ, ε
In particular, the set of all regular expression strings is not a regular language. We'll study grammars more, but the idea is to replace occurrences of the "non-terminal" R by values on the right-hand side until we are left with only "terminal" symbols (anything but R in this case).

Using Σ = {a,b}, examples of regular expressions are:
a, a*, a**, (ab), ((ab)c), (a∪b), (a∪(bc))
This type of grammar only has us to create fully-parenthesized expressions which can be correctly understood without any assumptions about operator precedence.

Given this set of regular expressions, we can define the set of regular languages by a function L which associates each expression to a language. The L function provides the "semantics" of the grammar, i.e., the assigning of mathematical objects to grammar constructs. The rules are exactly as dictated on page 49:
1) R ⇾ ∅, σ ∈ Σ, ε     L(R) = the obvious thing
2) R1R2*             L(R1) = L(R2)*
3) R1 ⇾ (R2R3)          L(R1) = L(R2) · L(R3)
4) R1 ⇾ (R2R3)         L(R1) = L(R2) ∪ L(R3)
One must technically prove that:
  1. these rules really do associate a regular expression to a language, i.e., L is defined
  2. there is only one language associated with a regular expression, i.e., L is well-defined.
The proof of (a) is by induction on the length of a regular expression. When L goes recursively from the LHS to the RHS, the L applications on the RHS are on expressions with at least one symbol less.

The proof of (b) is one of avoiding ambiguity. There is one and only one left-to-right replacement sequence using these rules to obtain the resultant language.

An interesting note is that, although regular expressions defined in this way are unambiguous, the syntax still allows generation of expressions like a** which are not legal in "the real world."

Regular Expressions the wrong way

For example, suppose we tried to avoid fully parenthesized expressions by using these rules:
1) RRR
2) RRR
3) RR*
4) R ⇾ ∅, σ ∈ Σ, ε, (R)
Then we could obtain the following two different interpretations of the regular expression a∪bc:
correct
RR1R2  (1)   L(R1) ∪ L(R2)
RaR2   (4)   L(a) ∪ L(R2)
RaR3R4 (2)   L(a) ∪ L(R3) · L(R4)
RabR4  (4)   L(a) ∪ L(b) · L(R4)
Rabc   (4)   L(a) ∪ L(b) · L(c) = { a, bc }
incorrect
RR1R2     (2)   L(R1) · L(R2)
RR1R3R2  (1)   ( L(R1) ∪ L(R3) ) · L(R2)
RaR3R2   (4)   ( L(a) ∪ L(R3) ) · L(R2)
Ra∪bR2    (4)   ( L(a) ∪ L(b) ) · L(R2)
Ra∪bc     (4)   ( L(a) ∪ L(b) ) · L(c) = { ac, bc }

Regular Expressions the right way

In order to correctly eliminate unnecessary parentheses, the grammar must be structured to provide correct precedence of the operations:
(lowest), juxtaposition, * (highest)
Here is a correct approach written in BNF grammar format:
RB{∪B}                   R = Regular expression  
BQ{Q}                    B = Branch
QA | A*                  Q = Quantified-atom
A ⇾ ∅, σ ∈ Σ, ε, (R)        A = Atom
The BNF syntax {..} means 0 or more repetitions. We can be loose about repetitions of B's and Q's because union and concatenation are associative.

Any single Atom, Quantified-atom, or Branch is a Regular expression. Examples:
ε, a, a*, abc, a*b*, a∪b, a∪ε, a∪bc, (a∪b)c, (a∪b)*(a∪c), a*b∪(a∪b)
This is NOT a regular expression: a**

As above, the L function can be employed to assign expressions to languages. In this way we can unambiguously define a regular language as the language of a "real" regular expression.

Equivalence of the two approaches

Theorem: The set of languages defined by the L function is the same as the set of languages defined in the Regular Languages via closure subsection. Although this is intuitively obvious, it should be proved like this: Both proof parts rely on induction.

What about closure under other language operations?

It turns out the Regular Languages are closed under both complementation and intersection. We'll see this result later.

Regular Expression Alternatives and Extensions

Some authors prefer using "+" instead of "", which is OK, except that it is also common to use +, ? and {..} as additional quantifiers with these meanings:
L+     = L · L*
L?     = L ∪ ε
L{n}   = Ln
L{n,m} = Ln ∪ Ln+1 ∪ ... ∪ Lm
A common notational alternative for is | which is used universally in real regular expressions but rarely used in theory.

Real regular expressions

By real regular expressions, I mean, of course, the kind of regular expressions which are used in programming environments. Of these, there are two main branches: You can find a comparison at the Wikipedia site http://en.wikipedia.org/wiki/Regular_expression.

There are several issues which arise which real regular expressions must solve: The outcome is the inclusion of these atoms: In particular the quantifiers *? and +? and the anchors are used to control the matching behavior and do not correspond to regular languages.


© Robert M. Kline