A regular expression is a string representing a pattern used for matching some portion(s) of a target string. Regular expressions are very general and as a consequence, very complex with many different types of operations represented as special characters, or meta-characters.
^ . [ $ ( ) | * + ? { \
A non-alphanumeric character acts as the literal character
(whether special or not) by
escaping it, namely, adding \ in front of it.
If the character list begins with '^', it matches any single character not from the rest of the list.
If two characters in the list are separated by '', this is shorthand for the (inclusive) range of characters between those two. It is illegal for two ranges to share an endpoint, e.g. a-c-e. Ranges are collating-sequence dependent and should be avoided for portability. Most special characters lose their special status and become literals within brackets. Additionally,
( regular_expression )
For example, these are atoms:
a b [abc] (a|b+c) [^bc]these, in addition to all above, are pieces:
a* c+ (ca*b)* (a|b+c){3}
these, in addition to all above, are branches:
aba a*b* [ab]*c* (a|b+c){3,10}(b|a+c){4}
these, in addition to all above, are regular expressions:
a*b|ab* a|b+|c* ((ab|ba){3,10}[cd]|aa*bc)+
| pattern | description |
|---|---|
| .|\n | matches any character |
| [a-zA-Z] | matches any letter |
| [a-z]{4} | matches a lower-case four-letter word |
| [^^-] | matches any character except ^ or - |
| ^[a-zA-Z]+$ | is a word of letters |
| ^[a-zA-Z_]\w*$ | is a Java identifier |
| ^[\d.]+$ | is a string of digits or dots |
| ^\s*$ | is whitespace sequence or empty |
| ^\s+|\s+$ | is a leading or terminating whitespace sequence |
| ^(I|you|them)$ | is one of these three words |
| ^\S+$ | is a non-whitespace sequence |
| ^[+-]?([1-9]\d*)?\d$ | is a signed integer (no leading zeros) |
| ^[+-]?\d*\.\d{1,6}$ | is a signed 1-6 place decimal number |
| ^([1-9]\d*)?\d(\.\d{2})?$ | is an unsigned decimal number with or without 2-place decimal |
string: abbabccd reg.expr. pattern: ((a(b+))+)(c*)The matched portions of the string are taken by the order of left parentheses, and so we would get:
abbab, ab, b, ccThe fact that the (a(b+)) repeated pattern matches both the abb and ab substrings is cause for confusion, but the last match is taken as the matching string.
^(\w+)(B+)$ (greedy) ^(\w+?)(B+)$ (minimal)In the former case, \w+ would match AABBB leaving a single B to match B+. In the latter case \w+ would minimally match only AA, leaving BBBB to match B+.
String patternStr = "[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?";
String[] tests = { "12", "+12", "-33.44", "0", "+0.11", "02", "1.3" };
for (String testStr: tests)
System.out.println( testStr.matches( patternStr ) );
Observe that we need
the literal "\" in the pattern, and therefore
must escape it, getting occurrences of "\\".
The match performed in this manner
is always a match of the complete string
(not a substring) as if the
anchor characters ^ and $ surrounded patternStr.
In this example all strings match except the last two.
Pattern.matches(patternStr,testStr);which behaves exactly like "testStr.matches(patternStr)" used above. More sophisticated regular expression operations use the following statements:
Pattern pattern = Pattern.compile(patternStr); Matcher matcher = pattern.matcher(testStr);The Pattern.compile operation can be used with a second parameter to specify other features of the intended matching operation. The most common example is to ensure that matches are case-insensitive by defining:
Pattern pattern = Pattern.compile(patternStr, Pattern.CASE_INSENSITIVE);
matcher.group() // the matched string portion matcher.start() // the start position of matched portion matcher.end() // the end position of matched portionFor example, consider the following program:
String patternStr = "[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?";
String testStr = "AB +22 C -4.51 D 8.0";
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(testStr);
System.out.println( "test matches pattern: " + matcher.find() );
The prints true signifying that testStr
contains
a match of the pattern. We can obtain and show all matches by
repeatedly applying matcher.find() in a loop like this:
while (matcher.find()) {
System.out.println(
matcher.group()
+ "\tstart-end: " + matcher.start() + "-" + matcher.end() );
}
This program segment illustrates that matcher.group() yields the
matched substring starting at position matcher.start() and ending
before matcher.end(). In this case there are four matched substrings:
+22, -4.51, 8, 0
patternStr = "([a-z]+)(\\d+)"; testStr = "Ab c55 24 Hello3 a.2 8a bbb00"; pattern = Pattern.compile(patternStr); matcher = pattern.matcher(testStr);The pattern represents a lower case letter sequence followed by digit sequence. In this case, the parenthesized subpatterns separate the letter sequence from the digit sequence. We can identify the substrings which match the parenthesized subpatterns. Consider this program segment:
while (matcher.find()) {
System.out.println( matcher.group()
+ "\tfirst: " + matcher.group(1) + ", second:" + matcher.group(2) );
}
The expression
matcher.group(i) yields the substring which matches the
subpattern defined by the
ith parenthesis subpattern.
System.out.println(testStr);
System.out.println(matcher.replaceFirst("---"));
System.out.println(matcher.replaceAll("==="));
System.out.println(matcher.replaceAll("$1:$2"));
would have the following output:
Ab c55 24 Hello3 a.2 8a bbb00 Ab --- 24 Hello3 a.2 8a bbb00 Ab === 24 H=== a.2 8a === Ab c:55 24 Hello:3 a.2 8a bbb:00In particular, the "$1", "$2" have special significance in the replacement string: they represent the matched substrings identified by the parenthesized subpatterns.