Regular Expressions
— print (last updated: Feb 2, 2009) print

Select font size:

Perl-compatible regular expressions

The so-called Perl compatible regular expressions offer enhancement to the POSIX-extended variety used in other software programs.

A regular expression is a string representing a pattern used for matching some portion(s) of a target string. Regular expressions are very general and as a consequence, very complex with many different types of operations represented as special characters, or meta-characters.

Regular expression atoms

Regular expression syntax is usually described in a grammatical form, but we'll describe it more loosely from bottom up, the way you would created one. At the bottom are the basic components, called atoms. Here is a a non-exhaustive list:

Quantifiers

Regular expressions use quantifiers to generate unbounded matching possibilities and other matching amount specifications. An atom can optionally be followed by one of these quantifiers: The quantifier can optionally be followed by "?" indicating that the match be minimal (non-greedy, reluctant) as opposed to the default maximal (greedy) match.

Full regular expressions

An atom optionally followed by a quantifier is called a piece.

The juxtaposition (concatenation) of pieces is called a branch. A target string matches a branch if some substring combination matches each corresponding piece.

A regular expression is a choice of branches represented by a string where the branches are separated by the | character. A choice is matched if one or more of the branches match.

For example, these are atoms:

a     b     [abc]     (a|b+c)   [^bc]
these, in addition to all above, are pieces:
a*    c+    (ca*b)*   (a|b+c){3}
these, in addition to all above, are branches:
aba   a*b*  [ab]*c*   (a|b+c){3,10}(b|a+c){4}
these, in addition to all above, are regular expressions:
a*b|ab*     a|b+|c*   ((ab|ba){3,10}[cd]|aa*bc)+

Regular Expression usage

Regular expressions are used commonly in programming contexts for these four types of operations:
  1. validate: determine if a string matches a pattern, giving a boolean (yes/no) result
  2. extract: extract portion(s) of a string which match a pattern
  3. substitute: substitute portion(s) of a string which match a pattern by a replacement string
  4. split: remove portions which match a pattern, thereby obtaining a list list of strings

Examples

When regular expression patterns are used in practice, they match a string if some portion of the string matches the pattern. In order to require that the entire string matches the pattern, it is common to use the ^ and $ anchors.

patterndescription
.|\n matches any character
[a-zA-Z] matches any letter
[a-z]{4} matches a lower-case four-letter word
[^^-] matches any character except ^ or -
^[a-zA-Z]+$ is a word of letters
^[a-zA-Z_]\w*$ is a Java identifier
^[\d.]+$ is a string of digits or dots
^\s*$ is whitespace sequence or empty
^\s+|\s+$ is a leading or terminating whitespace sequence
^(I|you|them)$ is one of these three words
^\S+$ is a non-whitespace sequence
^[+-]?([1-9]\d*)?\d$ is a signed integer (no leading zeros)
^[+-]?\d*\.\d{1,6}$ is a signed 1-6 place decimal number
^([1-9]\d*)?\d(\.\d{2})?$ is an unsigned decimal number
with or without 2-place decimal

Extracting Matched Portions

The parentheses are used in regular expression patterns to identify exactly which portions of the string match which portions of the pattern. For example, given:
string: abbabccd
reg.expr. pattern: ((a(b+))+)(c*)
The matched portions of the string are taken by the order of left parentheses, and so we would get:
abbab, ab, b, cc
The fact that the (a(b+)) repeated pattern matches both the abb and ab substrings is cause for confusion, but the last match is taken as the matching string.

Minimal match

The greedy and minimal (non-greedy) quantifiers differ only in terms of extraction of the matched portion. For example, consider the string AABBBB matched against these two patterns:
^(\w+)(B+)$          (greedy)
^(\w+?)(B+)$         (minimal)
In the former case, \w+ would match AABBB leaving a single B to match B+. In the latter case \w+ would minimally match only AA, leaving BBBB to match B+.

Java Regular Expression Handling

Validation via a regular expression

A Java String can use the member function matches to determine whether a it matches a regular expression or not. Here are some simple examples where the pattern string, patternStr, represents a signed number which is an integer with no leading zeros optionally followed by two decimal digits.
String patternStr = "[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?"; 

String[] tests = { "12", "+12", "-33.44", "0", "+0.11", "02", "1.3" };
for (String testStr: tests)
  System.out.println( testStr.matches( patternStr ) );
Observe that we need the literal "\" in the pattern, and therefore must escape it, getting occurrences of "\\". The match performed in this manner is always a match of the complete string (not a substring) as if the anchor characters ^ and $ surrounded patternStr. In this example all strings match except the last two.

The java.util.regex classes Pattern and Matcher

Java uses two classes Pattern and Matcher in the package java.util.regex for additional regular expression operations. An alternate way of expressing validation is via the static matches function
Pattern.matches(patternStr,testStr);
which behaves exactly like "testStr.matches(patternStr)" used above. More sophisticated regular expression operations use the following statements:
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(testStr);
The Pattern.compile operation can be used with a second parameter to specify other features of the intended matching operation. The most common example is to ensure that matches are case-insensitive by defining:
Pattern pattern = Pattern.compile(patternStr, Pattern.CASE_INSENSITIVE);

Substring matches

Calls to matcher.find() initiate the matching operations. If this function returns true, it means a match was found. Repeated calls will find all matches in the string. There are three particularly important member functions which are relevant when a match has been found:
matcher.group()           // the matched string portion
matcher.start()           // the start position of matched portion
matcher.end()             // the end position of matched portion
For example, consider the following program:
String patternStr = "[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?"; 
String testStr = "AB +22 C -4.51 D 8.0";

Pattern pattern = Pattern.compile(patternStr);  
Matcher matcher = pattern.matcher(testStr);

System.out.println( "test matches pattern: " + matcher.find() );
The prints true signifying that testStr contains a match of the pattern. We can obtain and show all matches by repeatedly applying matcher.find() in a loop like this:
while (matcher.find()) {
  System.out.println( 
       matcher.group() 
       + "\tstart-end: " + matcher.start() + "-" + matcher.end() );
}
This program segment illustrates that matcher.group() yields the matched substring starting at position matcher.start() and ending before matcher.end(). In this case there are four matched substrings:
+22, -4.51, 8, 0

Subpattern matches

In many circumstances we're interested in subpatterns of a matched pattern. For example, consider the pattern and test string:
patternStr = "([a-z]+)(\\d+)"; 
testStr = "Ab c55 24 Hello3 a.2 8a bbb00";

pattern = Pattern.compile(patternStr);   
matcher = pattern.matcher(testStr);
The pattern represents a lower case letter sequence followed by digit sequence. In this case, the parenthesized subpatterns separate the letter sequence from the digit sequence. We can identify the substrings which match the parenthesized subpatterns. Consider this program segment:
while (matcher.find()) {
  System.out.println( matcher.group() 
     + "\tfirst: " + matcher.group(1) + ", second:" + matcher.group(2) );
}
The expression matcher.group(i) yields the substring which matches the subpattern defined by the ith parenthesis subpattern.

Replacement

Replacement of matching pattern uses the replaceFirst and replaceAll member functions. Given the definition of the above matcher object, these calls:
System.out.println(testStr);
System.out.println(matcher.replaceFirst("---"));
System.out.println(matcher.replaceAll("==="));
System.out.println(matcher.replaceAll("$1:$2"));
would have the following output:
Ab c55 24 Hello3 a.2 8a bbb00
Ab --- 24 Hello3 a.2 8a bbb00
Ab === 24 H=== a.2 8a ===
Ab c:55 24 Hello:3 a.2 8a bbb:00
In particular, the "$1", "$2" have special significance in the replacement string: they represent the matched substrings identified by the parenthesized subpatterns.


© Robert M. Kline