Regular Expressions

The so-called Perl compatible regular expressions offer enhancement to the POSIX-extended variety used in other software programs.

A regular expression is a string representing a pattern used for matching some portion(s) of a target string. Regular expressions are very general and as a consequence, very complex with many different types of operations represented as special characters, or meta-characters.

Regular expression atoms

Regular expression syntax is usually described in a grammatical form, but we'll describe it more loosely from bottom up, the way you would created one. At the bottom are the basic components, called atoms. Here is a a non-exhaustive list:

Quantifiers

Regular expressions use quantifiers to generate unbounded matching possibilities and other matching amount specifications. An atom can optionally be followed by one of these quantifiers: The quantifier can optionally be followed by "?" indicating that the match be minimal (non-greedy, reluctant) as opposed to the default maximal (greedy) match.

Full regular expressions

An atom optionally followed by a quantifier is called a piece.

The juxtaposition (concatenation) of pieces is called a branch. A target string matches a branch if some substring combination matches each corresponding piece.

A regular expression is a choice of branches represented by a string where the branches are separated by the | character. A choice is matched if one or more of the branches match.

For example, these are atoms:

a     b     [abc]     (a|b+c)   [^bc]
these, in addition to all above, are pieces:
a*    c+    (ca*b)*   (a|b+c){3}
these, in addition to all above, are branches:
aba   a*b*  [ab]*c*   (a|b+c){3,10}(b|a+c){4}
these, in addition to all above, are regular expressions:
a*b|ab*     a|b+|c*   ((ab|ba){3,10}[cd]|aa*bc)+

Regular Expression usage

Regular expressions are used commonly in programming contexts for these four types of operations:
  1. validate: determine if a string matches a pattern, giving a boolean (yes/no) result
  2. extract: extract portion(s) of a string which match a pattern
  3. substitute: substitute portion(s) of a string which match a pattern by a replacement string
  4. split: remove portions which match a pattern, thereby obtaining a list of strings

Examples

When regular expression patterns are used in practice, they match a string if some portion of the string matches the pattern. In order to require that the entire string matches the pattern, it is common to use the ^ and $ anchors.

patterndescription
.|\n matches any character
[a-zA-Z] matches any letter
[a-z]{4}matches a lower-case four-letter word
[^^-] matches any character except ^ or -
^[a-zA-Z]+$ is a word of letters
^[a-zA-Z_]\w*$ is a Java identifier
^[\d.]+$ is a string of digits or dots
^\s*$ is whitespace sequence or empty
^\s+|\s+$ is a leading or terminating whitespace sequence
^(I|you|them)$ is one of these three words
^\S+$ is a non-whitespace sequence
^[+-]?([1-9]\d*)?\d$ is a signed integer (no leading zeros)
^[+-]?\d*\.\d{1,6}$ is a signed 1-6 place decimal number
^([1-9]\d*)?\d(\.\d{2})?$ is an unsigned decimal number
with or without 2-place decimal

Extracting Matched Portions

The parentheses are used in regular expression patterns to identify exactly which portions of the string match which portions of the pattern. For example, given:
string: abbabccd
reg.expr. pattern: ((a(b+))+)(c*)
The matched portions of the string are taken by the order of left parentheses, and so we would get:
abbab, ab, b, cc
The fact that the (a(b+)) repeated pattern matches both the abb and ab substrings is cause for confusion, but the last match is taken as the matching string.

Minimal match

The greedy and minimal (non-greedy) quantifiers differ only in terms of extraction of the matched portion. For example, consider the string AABBBB matched against these two patterns:
^(\w+)(B+)$          (greedy)
^(\w+?)(B+)$         (minimal)
In the former case, \w+ would match AABBB leaving a single B to match B+. In the latter case \w+ would minimally match only AA, leaving BBBB to match B+.


© Robert M. Kline