## Languages and Regular Languages

### Alphabets and Languages: section 1.7

An alphabet is any finite set, Σ; its elements are called symbols. For the most part we want to regard these as "single character" entities, but this is not necessary. Two fundamental things to define are "strings of symbols" and the "concatenation" operation. Although these are intuitively obvious, it is incumbent of us to regard them with some level of formalism.

A string w over Σ is a finite sequence of symbols. This means:
```w : {1..n} ⇾ Σ, for some n ≥ 0
```
where, for n ≥ 0, {1..n} = { i ∈ N : i ≤ n }. Consequently, for n = 0, {1..n} = ∅

For example, given Σ = {a..z}
```"hello": {1..5} ⇾ Σ is { (1,h), (2,e), (3,l), (4,l), (5,o) }
```
The empty string function (n = 0) is, in fact, the empty set, but we denote it by the special symbol ε. Other authors prefer λ.

The length of a string w, written |w|, is the number n = the number of tuples in the function. In particular, the length of the empty string is 0.

For a non-empty string w, w(i) = the symbol in the i-th position when written w = a1a2...an (the 1-based left-to-right presentation is only a convenience).

An individual symbol, σ, represents the string (1,σ).

The set of all strings is denoted by Σ*. A language over Σ is any subset of Σ*, i.e., an element of 2Σ*.

#### Concatenation

Given strings x, y, concatentation is the binary operation x·y:
```   x·y: {1..(|x|+|y|)} ⇾ Σ

x·y(i) = x(i),     if 0 < i ≤ |x|
= y(i-|x|), if |x| < i ≤ |x|+|y|
```
Consequently, it is easily verified that ε is the concatenation identity element:
```for all strings x, ε·x = x·ε = x
```
The other intuitively obvious, but important, property of string concatenation is that of associativity, i.e.,
```(x·y)·z = x·(y·z)
```
An argument needs to be made that this is, in fact true. This is left as an exercise in the textbook which is incredibly tedious.

#### Languages & Operations

A language is a set, and so any set operation is a valid language operation. This includes union, intersection, complement, etc. Individual strings can be singleton sets, and we're particular interested in the these basis sets:
```∅,
ε = { ε },
σ = { σ }, for each σ ∈ Σ
```
We can also define language concatenation
```L1·L2 = { w : w = x·y for some x ∈ L1 and y ∈ L2 }
```
What is the identity?   ε = { ε }

Likewise, we can define, for n > 0:
```Ln = { x1x2...xn : where xi ∈ L, i = 1..n }
```
The definition of L0 is a formality of sorts. It's the concatenation identity, like the numerical x0 = 1.
```L0 = { ε }
```

#### Inductive definition of Ln

To be more precise, we state an inductive definition of Ln
```L0 = { ε }
Ln = L · Ln-1, n ≥ 1
```

#### Language operation algebra

With language concatenation and set union, we have a system that appears very algebraic:
```∪ (union)          is like numerical +, with identity ∅
· (concatenation)  is like numerical *, with identity ε
```
We have the common algebraic identity properties:
```L ∪ ∅ = L,  L · ∅ = ∅,  L · ε = L
```
We also have the associativity properties:
```( L1 ∪ L2 ) ∪ L3 = L1 ∪ ( L2 ∪ L3 )
( L1 · L2 ) · L3 = L1 · ( L2 · L3 )
```
The associativity of · for languages follows from the associativity for individual strings. It's also easy to prove the distributive property:
```L1 · ( L2 ∪ L3 ) = L1 · L2  ∪  L1 · L3
```

#### Kleene Star Operation

Up to this point we can only create finite languages. Its the new operation Kleene Star which creates infinite languages:
```L* = { w : w ∈ Ln, for some n ≥ 0 }
```
Note the consistency w.r.t. our usage of Σ* as the set of all strings. Just think of symbols in Σ as strings. The Kleene Star operation is responsible for generating infinite languages.

A more precise definition of Kleene Star is done via closure: L* is the closure under these properties:
```{ ε } ⊆ L*
L ⊆ L*
∀ L1, L2 ⊆ L*, L1 · L2  ⊆ L*
```

### Regular Expressions and Languages: section 1.8

We're interested in constructing languages based on the operations union, concatenation, and Kleene*. There are two notions to explore:
• regular languages over Σ: types of languages over Σ which satisfy certain properties; a regular language is an abstract mathematical concept; we can express how it is created in any way which conveys understanding of what it is.
• regular expressions over Σ: these strings of symbols, where the symbols contain those of Σ as well as other operator symbols, or meta-symbols
We want to equate these two concepts in practice, but it's important to separate them initially.

#### Regular Languages via closure

A purely mathematical approach is sufficient to define regular languages. Consider the set R of regular languages. R is a subset of the set of all languages, 2Σ*. Define R as the closure subject to these properties:
```∅ ∈ R,
σ ∈ R, ∀ σ ∈ Σ
L1, L2 ∈ R ⇒ L1 ∪ L2 ∈ R
L1, L2 ∈ R ⇒ L1 · L2 ∈ R
L ∈ R ⇒ L* ∈ R
```
Keep in mind that ∅* = ε and so there is no need to explicitly require:
```ε ∈ R
```
It's complicated to say what the closure is, because your working with sets of languages, not languages, but here goes:
The set of regular languages,
R = {  All sets of languages, S :
∀ S1, S2 ∈ S :
∅ ∈ S1
σ ∈ S1,  ∀ σ ∈ Σ
S1 ∪ S2 ∈ S
S1 · S2 ∈ S
S1* ∈ S

#### Regular Expressions by the textbook

A regular expression is a string consisting of the symbols:
```∅, σ ∈ Σ, ε, *, ∪, (, )
```
As indicated in the textbook, ε is not really necessary since it is equivalent to the expression ∅*. The textbook's definition of a regular expression is via a grammar. Written in more strict grammatical notation, we would write:
```1) R ⇾ (R∪R)
2) R ⇾ (RR)
3) R ⇾ R*
4) R ⇾ ∅, σ ∈ Σ, ε
```
In particular, the set of all regular expression strings is not a regular language. We'll study grammars more, but the idea is to replace occurrences of the "non-terminal" R by values on the right-hand side until we are left with only "terminal" symbols (anything but R in this case).

Using Σ = {a,b}, examples of regular expressions are:
```a, a*, a**, (ab), ((ab)c), (a∪b), (a∪(bc))
```
This type of grammar only has us to create fully-parenthesized expressions which can be correctly understood without any assumptions about operator precedence.

Given this set of regular expressions, we can define the set of regular languages by a function L which associates each expression to a language. The L function provides the "semantics" of the grammar, i.e., the assigning of mathematical objects to grammar constructs. The rules are exactly as dictated on page 49:
```1) R ⇾ ∅, σ ∈ Σ, ε     L(R) = the obvious thing
2) R1 ⇾ R2*             L(R1) = L(R2)*
3) R1 ⇾ (R2R3)          L(R1) = L(R2) · L(R3)
4) R1 ⇾ (R2∪R3)         L(R1) = L(R2) ∪ L(R3)
```
One must technically prove that:
1. these rules really do associate a regular expression to a language, i.e., L is defined
2. there is only one language associated with a regular expression, i.e., L is well-defined.
The proof of (a) is by induction on the length of a regular expression. When L goes recursively from the LHS to the RHS, the L applications on the RHS are on expressions with at least one symbol less.

The proof of (b) is one of avoiding ambiguity. There is one and only one left-to-right replacement sequence using these rules to obtain the resultant language.

An interesting note is that, although regular expressions defined in this way are unambiguous, the syntax still allows generation of expressions like a** which are not legal in "the real world."

#### Regular Expressions the wrong way

For example, suppose we tried to avoid fully parenthesized expressions by using these rules:
```1) R ⇾ R∪R
2) R ⇾ RR
3) R ⇾ R*
4) R ⇾ ∅, σ ∈ Σ, ε, (R)
```
Then we could obtain the following two different interpretations of the regular expression a∪bc:
correct
```R ⇒ R1∪R2  (1)   L(R1) ∪ L(R2)
R ⇒ a∪R2   (4)   L(a) ∪ L(R2)
R ⇒ a∪R3R4 (2)   L(a) ∪ L(R3) · L(R4)
R ⇒ a∪bR4  (4)   L(a) ∪ L(b) · L(R4)
R ⇒ a∪bc   (4)   L(a) ∪ L(b) · L(c) = { a, bc }
```
incorrect
```R ⇒ R1R2     (2)   L(R1) · L(R2)
R ⇒ R1∪R3R2  (1)   ( L(R1) ∪ L(R3) ) · L(R2)
R ⇒ a∪R3R2   (4)   ( L(a) ∪ L(R3) ) · L(R2)
R ⇒ a∪bR2    (4)   ( L(a) ∪ L(b) ) · L(R2)
R ⇒ a∪bc     (4)   ( L(a) ∪ L(b) ) · L(c) = { ac, bc }
```

#### Regular Expressions the right way

In order to correctly eliminate unnecessary parentheses, the grammar must be structured to provide correct precedence of the operations:
```∪ (lowest), juxtaposition, * (highest)
```
Here is a correct approach written in BNF grammar format:
```R ⇾ B{∪B}                   R = Regular expression
B ⇾ Q{Q}                    B = Branch
Q ⇾ A | A*                  Q = Quantified-atom
A ⇾ ∅, σ ∈ Σ, ε, (R)        A = Atom
```
The BNF syntax {..} means 0 or more repetitions. We can be loose about repetitions of B's and Q's because union and concatenation are associative.

Any single Atom, Quantified-atom, or Branch is a Regular expression. Examples:
```ε, a, a*, abc, a*b*, a∪b, a∪ε, a∪bc, (a∪b)c, (a∪b)*(a∪c), a*b∪(a∪b)
```
This is NOT a regular expression: a**

As above, the L function can be employed to assign expressions to languages. In this way we can unambiguously define a regular language as the language of a "real" regular expression.

#### Equivalence of the two approaches

Theorem: The set of languages defined by the L function is the same as the set of languages defined in the Regular Languages via closure subsection. Although this is intuitively obvious, it should be proved like this:
• if r is a regular expression, then L(r)R
• if LR then ∃ a regular expression r such that L = L(r)
Both proof parts rely on induction.

#### What about closure under other language operations?

It turns out the Regular Languages are closed under both complementation and intersection. We'll see this result later.

### Regular Expression Alternatives and Extensions

Some authors prefer using "+" instead of "", which is OK, except that it is also common to use +, ? and {..} as additional quantifiers with these meanings:
```L+     = L · L*
L?     = L ∪ ε
L{n}   = Ln
L{n,m} = Ln ∪ Ln+1 ∪ ... ∪ Lm
```
A common notational alternative for is | which is used universally in real regular expressions but rarely used in theory.

#### Real regular expressions

By real regular expressions, I mean, of course, the kind of regular expressions which are used in programming environments. Of these, there are two main branches:
You can find a comparison at the Wikipedia site http://en.wikipedia.org/wiki/Regular_expression.

There are several issues which arise which real regular expressions must solve:
• the alphabet includes all characters, including the operator characters (or meta-characters) *, (, ) which were assumed to be separate from the alphabet
• we need even more atomic expressions, e.g. expressing "any character" should be simple
• we need extensions which do not directly correspond to regular language interpretation
The outcome is the inclusion of these atoms:
• the special character . (dot) represents any character except \n
• special characters acting as literals, which must be escaped: \., \*, \(, \\ in most circumstances
• symbol sets (bracket expressions) [ab] = (a|b), [abc] = (a|b|c), [a-z], [^a]
• special sets: \s, \S, \w, \W, \d, \D (perl-style),
• additional quantifiers: +, ?, {n}, {n,m}, and the minimal match ones: *?, +?
• anchors: ^, \$, \b
In particular the quantifiers *? and +? and the anchors are used to control the matching behavior and do not correspond to regular languages.