Alphabets and Languages: section 1.7
An alphabet is any finite set, Σ; its elements are called symbols. For the most part we want to regard these as "single character" entities, but this is not necessary. Two fundamental things to define are "strings of symbols" and the "concatenation" operation. Although these are intuitively obvious, it is incumbent of us to regard them with some level of formalism. A string w over Σ is a finite sequence of symbols. This means:w : {1..n} ⇾ Σ, for some n ≥ 0where, for n ≥ 0, {1..n} = { i ∈ N : i ≤ n }. Consequently, for n = 0, {1..n} = ∅ For example, given Σ = {a..z}
"hello": {1..5} ⇾ Σ is { (1,h), (2,e), (3,l), (4,l), (5,o) }
The empty string function (n = 0) is, in fact, the empty set,
but we denote it by the special symbol ε.
Other authors prefer λ.
The length of a string w, written |w|,
is the number n = the number of tuples in the function.
In particular, the length of the empty string is 0.
For a non-empty string w, w(i) = the symbol in the i-th position
when written w = a_{1}a_{2}...a_{n}
(the 1-based left-to-right presentation is only a convenience).
An individual symbol, σ, represents the string (1,σ).
The set of all strings is denoted by Σ^{*}.
A language over Σ is any subset of Σ^{*},
i.e., an element of 2^{Σ*}.
Concatenation
Given strings x, y, concatentation is the binary operation x·y:x·y: {1..(|x|+|y|)} ⇾ Σ x·y(i) = x(i), if 0 < i ≤ |x| = y(i-|x|), if |x| < i ≤ |x|+|y|Consequently, it is easily verified that ε is the concatenation identity element:
for all strings x, ε·x = x·ε = xThe other intuitively obvious, but important, property of string concatenation is that of associativity, i.e.,
(x·y)·z = x·(y·z)An argument needs to be made that this is, in fact true. This is left as an exercise in the textbook which is incredibly tedious.
Languages & Operations
A language is a set, and so any set operation is a valid language operation. This includes union, intersection, complement, etc. Individual strings can be singleton sets, and we're particular interested in the these basis sets:∅,
ε = { ε },
σ = { σ }, for each σ ∈ Σ
We can also define language concatenation
L_{1}·L_{2} = { w : w = x·y for some x ∈ L_{1} and y ∈ L_{2} }What is the identity? ε = { ε } Likewise, we can define, for n > 0:
L^{n} = { x_{1}x_{2}...x_{n} : where x_{i} ∈ L, i = 1..n }
The definition of L^{0} is a formality of sorts.
It's the concatenation identity,
like the numerical x^{0} = 1.
L^{0} = { ε }
Inductive definition of L^{n}
To be more precise, we state an inductive definition of L^{n}L^{0} = { ε } L^{n} = L · L^{n-1}, n ≥ 1
Language operation algebra
With language concatenation and set union, we have a system that appears very algebraic:∪ (union) is like numerical +, with identity ∅ · (concatenation) is like numerical *, with identity εWe have the common algebraic identity properties:
L ∪ ∅ = L, L · ∅ = ∅, L · ε = LWe also have the associativity properties:
( L_{1} ∪ L_{2} ) ∪ L_{3} = L_{1} ∪ ( L_{2} ∪ L_{3} ) ( L_{1} · L_{2} ) · L_{3} = L_{1} · ( L_{2} · L_{3} )The associativity of · for languages follows from the associativity for individual strings. It's also easy to prove the distributive property:
L_{1} · ( L_{2} ∪ L_{3} ) = L_{1} · L_{2} ∪ L_{1} · L_{3}
Kleene Star Operation
Up to this point we can only create finite languages. Its the new operation Kleene Star which creates infinite languages:L^{*} = { w : w ∈ L^{n}, for some n ≥ 0 }Note the consistency w.r.t. our usage of Σ^{*} as the set of all strings. Just think of symbols in Σ as strings. The Kleene Star operation is responsible for generating infinite languages. A more precise definition of Kleene Star is done via closure: L^{*} is the closure under these properties:
{ ε } ⊆ L^{*} L ⊆ L^{*} ∀ L_{1}, L_{2} ⊆ L^{*}, L_{1} · L_{2} ⊆ L^{*}
Regular Expressions and Languages: section 1.8
We're interested in constructing languages based on the operations union, concatenation, and Kleene*. There are two notions to explore:- regular languages over Σ: types of languages over Σ which satisfy certain properties; a regular language is an abstract mathematical concept; we can express how it is created in any way which conveys understanding of what it is.
- regular expressions over Σ: these strings of symbols, where the symbols contain those of Σ as well as other operator symbols, or meta-symbols
Regular Languages via closure
A purely mathematical approach is sufficient to define regular languages. Consider the set R of regular languages. R is a subset of the set of all languages, 2^{Σ*}. Define R as the closure subject to these properties:∅ ∈ R, σ ∈ R, ∀ σ ∈ Σ L_{1}, L_{2} ∈ R ⇒ L_{1} ∪ L_{2} ∈ R L_{1}, L_{2} ∈ R ⇒ L_{1} · L_{2} ∈ R L ∈ R ⇒ L^{*} ∈ RKeep in mind that ∅^{*} = ε and so there is no need to explicitly require:
ε ∈ R
It's complicated to say what the closure is, because your working
with sets of languages, not languages, but here goes:
The set of regular languages,
R = ∩ { All sets of languages, S :
∀ S_{1}, S_{2} ∈ S :
∅ ∈ S_{1} and ∅ ∈ S_{2}
σ ∈ S_{1} and σ ∈ S_{2}, ∀ σ ∈ Σ
S_{1} ∩ S_{2} ∈ S
S_{1} · S_{2} ∈ S
S_{1}* ∈ S
Regular Expressions by the textbook
A regular expression is a string consisting of the symbols:∅, σ ∈ Σ, ε, *, ∪, (, )As indicated in the textbook, ε is not really necessary since it is equivalent to the expression ∅*. The textbook's definition of a regular expression is via a grammar. Written in more strict grammatical notation, we would write:
1) R ⇾ (R∪R) 2) R ⇾ (RR) 3) R ⇾ R* 4) R ⇾ ∅, σ ∈ Σ, εIn particular, the set of all regular expression strings is not a regular language. We'll study grammars more, but the idea is to replace occurrences of the "non-terminal" R by values on the right-hand side until we are left with only "terminal" symbols (anything but R in this case). Using Σ = {a,b}, examples of regular expressions are:
a, a*, a**, (ab), ((ab)c), (a∪b), (a∪(bc))This type of grammar only has us to create fully-parenthesized expressions which can be correctly understood without any assumptions about operator precedence. Given this set of regular expressions, we can define the set of regular languages by a function L which associates each expression to a language. The L function provides the "semantics" of the grammar, i.e., the assigning of mathematical objects to grammar constructs. The rules are exactly as dictated on page 49:
1) R ⇾ ∅, σ ∈ Σ, ε L(R) = the obvious thing
2) R_{1} ⇾ R_{2}* L(R_{1}) = L(R_{2})*
3) R_{1} ⇾ (R_{2}R_{3}) L(R_{1}) = L(R_{2}) · L(R_{3})
4) R_{1} ⇾ (R_{2}∪R_{3}) L(R_{1}) = L(R_{2}) ∪ L(R_{3})
One must technically prove that:
- these rules really do associate a regular expression to a language, i.e., L is defined
- there is only one language associated with a regular expression, i.e., L is well-defined.
Regular Expressions the wrong way
For example, suppose we tried to avoid fully parenthesized expressions by using these rules:1) R ⇾ R∪R 2) R ⇾ RR 3) R ⇾ R* 4) R ⇾ ∅, σ ∈ Σ, ε, (R)Then we could obtain the following two different interpretations of the regular expression a∪bc:
correct |
---|
R ⇒ R_{1}∪R_{2} (1) L(R_{1}) ∪ L(R_{2}) R ⇒ a∪R_{2} (4) L(a) ∪ L(R_{2}) R ⇒ a∪R_{3}R_{4} (2) L(a) ∪ L(R_{3}) · L(R_{4}) R ⇒ a∪bR_{4} (4) L(a) ∪ L(b) · L(R_{4}) R ⇒ a∪bc (4) L(a) ∪ L(b) · L(c) = { a, bc } |
incorrect |
R ⇒ R_{1}R_{2} (2) L(R_{1}) · L(R_{2}) R ⇒ R_{1}∪R_{3}R_{2} (1) ( L(R_{1}) ∪ L(R_{3}) ) · L(R_{2}) R ⇒ a∪R_{3}R_{2} (4) ( L(a) ∪ L(R_{3}) ) · L(R_{2}) R ⇒ a∪bR_{2} (4) ( L(a) ∪ L(b) ) · L(R_{2}) R ⇒ a∪bc (4) ( L(a) ∪ L(b) ) · L(c) = { ac, bc } |
Regular Expressions the right way
In order to correctly eliminate unnecessary parentheses, the grammar must be structured to provide correct precedence of the operations:∪ (lowest), juxtaposition, * (highest)Here is a correct approach written in BNF grammar format:
R ⇾ B{∪B} R = Regular expression B ⇾ Q{Q} B = Branch Q ⇾ A | A* Q = Quantified-atom A ⇾ ∅, σ ∈ Σ, ε, (R) A = AtomThe BNF syntax {..} means 0 or more repetitions. We can be loose about repetitions of B's and Q's because union and concatenation are associative. Any single Atom, Quantified-atom, or Branch is a Regular expression. Examples:
ε, a, a*, abc, a*b*, a∪b, a∪ε, a∪bc, (a∪b)c, (a∪b)*(a∪c), a*b∪(a∪b)This is NOT a regular expression: a** As above, the L function can be employed to assign expressions to languages. In this way we can unambiguously define a regular language as the language of a "real" regular expression.
Equivalence of the two approaches
Theorem: The set of languages defined by the L function is the same as the set of languages defined in the Regular Languages via closure subsection. Although this is intuitively obvious, it should be proved like this:- if r is a regular expression, then L(r) ∈ R
- if L ∈ R then ∃ a regular expression r such that L = L(r)
What about closure under other language operations?
It turns out the Regular Languages are closed under both complementation and intersection. We'll see this result later.Regular Expression Alternatives and Extensions
Some authors prefer using "+" instead of "∪", which is OK, except that it is also common to use +, ? and {..} as additional quantifiers with these meanings:L+ = L · L* L? = L ∪ ε L{n} = L^{n} L{n,m} = L^{n} ∪ L^{n+1} ∪ ... ∪ L^{m}A common notational alternative for ∪ is | which is used universally in real regular expressions but rarely used in theory.
Real regular expressions
By real regular expressions, I mean, of course, the kind of regular expressions which are used in programming environments. Of these, there are two main branches:- POSIX regular expressions: used primarily by UNIX commands like sed, grep, awk, etc.
- Perl compatible regular expressions: used by most programming languages.
- the alphabet includes all characters, including the operator characters (or meta-characters) *, (, ) which were assumed to be separate from the alphabet
- we need even more atomic expressions, e.g. expressing "any character" should be simple
- we need extensions which do not directly correspond to regular language interpretation
- the special character . (dot) represents any character except \n
- special characters acting as literals, which must be escaped: \., \*, \(, \\ in most circumstances
- symbol sets (bracket expressions) [ab] = (a|b), [abc] = (a|b|c), [a-z], [^a]
- special sets: \s, \S, \w, \W, \d, \D (perl-style),
- additional quantifiers: +, ?, {n}, {n,m}, and the minimal match ones: *?, +?
- anchors: ^, $, \b