Java Regular Expressions

This document provides small demonstration programs in the form of Java main classes meant to be created and run individually within a parent NetBeans project JavaRegex. Java regular expressions are used, as in other programming languages, to solve these problems:
  1. validate: determine if a string matches a pattern, giving a boolean (yes/no) result
  2. extract: extract portion(s) of a string which match a pattern
  3. substitute: substitute portion(s) of a string which match a pattern by a replacement string
  4. split: remove portions which match a pattern, thereby obtaining a list of strings
In Java all regular expressions are used as Strings, which means that a number of special characters must be escaped. For example, the common regular expression \w+ would be used as the String
"\\w+"

String-based operations

Many of the simple validation and replacement operations by regular expressions can be achieved by String-based member functions. The matches member function is used as follows:
if ( sample.matches( pattern_string ) ) {
  // sample does match the pattern
}
The pattern_string is regarded as complete in the sense that the entire string must match the pattern. In particular, one should not use the initial and terminal anchors ^ and $ to delimit the pattern as one would do in other situations.

Java has a number of String member functions which start with the prefix "replace"; however only these two rely on regular expressions to determine the replacement substrings. Unlike matches, these operations are allowed to match substrings of the sample string.
String new_string = sample.replaceFirst( pattern_string, replacement );
String new_string = sample.replaceAll( pattern_string, replacement );
The following sample program illustrates these usages:
javaregex.StringRegexOps  
The limitation of this type of replacement scheme is that it cannot generate the replacement string based on the matched content. For example, we cannot use replaceAll to do these operation:
  1. surround every occurrence which matches "[a-z]+\d+" by delimiters like "<" and ">"
  2. capitalize every occurrence of a substring which matches "[a-z]+\d+"

The java.util.regex classes

For more sophisticated matching/replacement operations, Java uses the classes Pattern and Matcher in the java.util.regex package. An alternate way of expressing validation is via the function:
Pattern.matches(pattern_string, sample);
which behaves exactly like "sample.matches(pattern_string)" used above. More sophisticated regular expression operations use the following statements:
Pattern pattern = Pattern.compile(pattern_string);
Matcher matcher = pattern.matcher(sample);
The Pattern.compile operation can use a second parameter to specify other features of the intended matching operation. The most common example is to ensure that matches are case-insensitive by defining:
Pattern pattern = Pattern.compile(pattern_string, Pattern.CASE_INSENSITIVE);
The matching operation is initiated by the call:
matcher.find()
One useful feature of the matcher.find() which is crucial to our later example is the ability to produce the string positions which delimit the matched substring using these member functions:
int start = matcher.start();
int end = matcher.end();
The following sample program illustrates these usages:
javaregex.SubstringMatch  
The expression matcher.group() yields the entire matched substring, while matcher.start() (inclusive) and matcher.end() (exclusive) reveal the delimiting indices within the full string.

Subpattern matches

In many circumstances we're interested in subpatterns of a matched pattern. For example, consider the pattern and test string:
pattern_string = "([a-z]+)(\\d+)"; 
sample = "Ab c55 24 Hello3 a.2 8a bbb00";
 
pattern = Pattern.compile(patternStr);   
matcher = pattern.matcher(testStr);
In this case, our usual pattern uses parenthesized subpatterns which separate the letter sequence from the digit sequence. We can identify the substrings which match the parenthesized subpatterns. The following sample program illustrates these usages:
javaregex.SubpatternMatch  
matcher.group(i) is the substring matching the subpattern defined by the ith parenthesis pair.

Replacement

Replacement of matching pattern uses the replaceFirst and replaceAll member functions. The effect is basically the same with the exception that you can use the matched portions within the replacement arguments. The following sample program illustrates these usages:
javaregex.Replacement  
Within the replacement string, we can use the special variables $0, $1, $2 with these meanings:
$0 = the entire matched substring 
(corresponding to matcher.group()),
$1 = the first parenthesized match subpattern 
(corresponding to matcher.group(1)), ...


© Robert M. Kline