Php Regular Expressions

This document gives code which can be used to run some simple tests by copy/pasting the code segments into a Php file. For example, you can create a new Php Project, say, PhpRegexProgs, and use the auto-created file, index.php, to drop in the sample segments, and then execute it either from the shell or a web interface. You can insert this test block into the body:
<pre>
<?php
 
/* sample code */
 
?>
</pre>
select
The Php regular expression functions of main interest are the so-called Perl-compatible versions which use the "preg_" prefix: The general form of these operations is this:
preg_OPERATION( "/REGULAR_EXPR/qualifiers", ... )
The regular expression used for the operation is always bounded by the "/" delimiters. The main qualifier is "i" which indicates case-insensitivity in matching.

Validation

Validation usually means that a string must completely match a regular expression. Unlike Java's String.match operation, the only way to force the "completeness" is to use pin down the match with the beginning and ending anchors ^ and $, respectively.

Here are some simple examples where the pattern string, $patternStr, represents a signed number which is an integer with no leading zeros optionally followed by two decimal digits.
$patternStr = '^[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?$'; 
 
echo "pattern: $patternStr\n";
 
$tests = array( "12", "+12", "-33.44", "0", "+0.11", "02", "1.3" );
foreach ($tests as $test) {
  echo "pattern matches $test ? "; 
  if ( preg_match( "/$patternStr/", $test ) )
    echo "yes\n"; 
  else 
    echo "no\n";
}  
select
As is the case in all of Java's regular expression handling, when pattern string is defined externally as above, the literal "\" in the pattern must be escaped, creating occurrences of "\\". The pattern string can be bounded by double quotes to allow interpolation. In this case the literal "$" should also be escaped unless it's at the end of the string. Thus, these are both OK:
$patternStr = "^[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?$"; 
$patternStr = "^[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?\$"; 
Alternatively, the pattern string can be placed directly into the preg_match function, in which case we need not escape the backslashes. We would write it this way:
preg_match( "/^[+-]?([1-9]\d*)?\d(\.\d{2})?$/", $test )
To achieve case-insensitivity in a match (or other operations), add the "i" qualifier after the terminal "/" as in these examples:
preg_match( "/^a+$/", "aaa")    is    true
preg_match( "/^a+$/", "aAa")    is    false 
preg_match( "/^a+$/i", "aAa")   is    true

Matched substrings

When we're interested in extracting substrings from matched portions of a larger string we use preg_match and preg_match_all with a third parameter used to capture the matched information. In these situations, we typically do not want to use the beginning and ending anchors. Consider the following program:
$patternStr = '[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?'; 
 
$testStr = "AB +22 C -4.51  D  8.0";
 
preg_match( "/$patternStr/", $testStr, $matches );
print_r($matches[0]); echo "\n";
 
preg_match_all( "/$patternStr/", $testStr, $matches );
print_r($matches[0]); echo "\n";
A true return value indicates that some substring matches the pattern. In this first case $matches[0] captures the first match, i.e.,
$matches[0] = "+22"
In the second case using preg_match_all, the matching operation goes to the end of the test string, obtaining:
$matches[0] = array( "+22", "-4.51", "8", "0" )
You may consider that this procedure seems an odd use of $matches having $matches[0] hold everything. What about the $matches[1], ... ? The answer has to do with subpattern matches.

Subpattern matches

In many circumstances we're interested in subpatterns of a matched pattern. For example, consider the pattern and test string as inputs to the matching operations:
$patternStr = "([a-z]+)(\\d+)"; 
$testStr = "Ab c55 24 Hello3 a.2 8a bbb00";
 
preg_match( "/$patternStr/", $testStr, $matches );
print_r($matches);
 
preg_match_all( "/$patternStr/", $testStr, $matches );
print_r($matches);
The pattern represents a lower case letter sequence followed by digit sequence. The two parenthesized subpatterns separate the letter sequence from the digit sequence. As before, preg_match pertains only to the first match, storing into $matches:
Array( [0] => c55, [1] => c, [2] => 55 )
The "0" entry is the entire match, and entry n is the match of the n-th parenthesized subpattern.

In the second example using preg_match_all, $matches[0][k] becomes the full k-th match, and $matches[n][k] becomes the n-th parenthesized portion of the k-th match.

Thus, after
preg_match_all( "/$patternStr/", $testStr, $matches );
$matches is
[0] => Array ( [0] => c55  [1] => ello3  [2] => bbb00 )
[1] => Array ( [0] => c    [1] => ello   [2] => bbb   )
[2] => Array ( [0] => 55   [1] => 3      [2] => 00    )

Replacement

Simple replacement uses preg_replace. The 3-argument call is:
preg_replace( PATTERN, REPLACEMENT_STRING, INPUT_STRING )
replaces all matching substrings by the replacement. A fourth integer parameter can be used to limit the number of replacements. Thus this code:
$patternStr = "([a-z]+)(\\d+)"; 
$testStr = "Ab c55 24 Hello3 a.2 8a bbb00";
$replace = "----";
echo preg_replace( "/$patternStr/", $replace, $testStr ), "\n";
echo preg_replace( "/$patternStr/", $replace, $testStr, 1 ), "\n";
yields the following output:
Ab ---- 24 H---- a.2 8a ----
Ab ---- 24 Hello3 a.2 8a bbb00

Replacement with callback

The preg_replace_callback affords the most general possibilities. The call looks like this:
preg_replace( "/PATTERN/", "my_callback_function", TARGET_STRING )
where my_callback_function is a function is defined as follows:
function my_callback_function($m) {
  // using    $m[0] = the entire matched portion
  // and/or,  $m[n] = the matched portion for the n-th parenthesis group
  return /* the replacement code */;
}
select
Therefore this code segment:
$patternStr = "([a-z]+)(\\d+)"; 
$testStr = "Ab c55 24 Hello3 a.2 8a bbb00";
 
function replaceOp($m) {
   return $m[1] . ":" . ($m[2]+1);
}
select
transforms Ab c55 24 Hello3 a.2 8a bbb00
into Ab c:56 24 Hello:4 a.2 8a bbb:1


© Robert M. Kline