Php Regular Expressions
— print (last updated: Mar 10, 2009) print

Select font size:
This document gives code which can be used to run some simple tests by copy/pasting the code segments into a Php file:
<?php
# testfile.php

//sample code
running it from the command line. It also provides a complete Php project PhpWordSearch which can be installed and executed from the default website. Download and install the PhpWordSearch.zip archive to do so. You will need the Dojo toolkit installed as described in the Php AJAX + Dojo document.

Command-line Php execution

Ideally you want to be able to create and run a .php anywhere within the system via:
php mySampleFile.php
On a Linux system, you simply have to ensure that the command-line version of the Php package (on Ubuntu, php5-cli) is installed. On Windows, using our installation setup, the command-line executable is C:\php\php.exe It is likely that C:\php\ is not a component of the PATH. Your options are: The first alternative is most desirable because you may find Php to be a useful programming alternative.

Regular expression functions

The Php regular expression functions of main interest are the so-called Perl-compatible versions which use the "preg_" prefix: The general form of these operations is this:
preg_OPERATION( "/REGULAR_EXPR/qualifiers", ... )
The regular expression used for the operation is always bounded by the "/" delimiters. The main qualifier is "i" which indicates case-insensitivity in matching.

Validation via a regular expression

Validation usually means that a string must completely match a regular expression. Unlike Java's String.match operation, the only way to force the "completeness" is to use pin down the match with the beginning and ending anchors ^ and $, respectively.

Here are some simple examples where the pattern string, $patternStr, represents a signed number which is an integer with no leading zeros optionally followed by two decimal digits.
$patternStr = '^[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?$'; 

echo "pattern: $patternStr\n";

$tests = array( "12", "+12", "-33.44", "0", "+0.11", "02", "1.3" );
foreach ($tests as $test) {
  echo "pattern matches $test ? "; 
  if ( preg_match( "/$patternStr/", $test ) )
    echo "yes\n"; 
  else 
    echo "no\n";
}  
As is the case in all of Java's regular expression handling, when pattern string is defined externally as above, the literal "\" in the pattern must be escaped, creating occurrences of "\\". The pattern string can be bounded by double quotes to allow interpolation. In this case the literal "$" should also be escaped unless it's at the end of the string. Thus, these are both OK:
$patternStr = "^[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?$"; 
$patternStr = "^[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?\$"; 
Alternatively, the pattern string can be placed directly into the preg_match function, in which case we need not escape the backslashes. We would write it this way:
preg_match( "/^[+-]?([1-9]\d*)?\d(\.\d{2})?$/", $test )
To achieve case-insensitivity in a match (or other operations), add the "i" qualifier after the terminal "/" as in these examples:
preg_match( "/^a+$/", "aaa")    is    true
preg_match( "/^a+$/", "aAa")    is    false 
preg_match( "/^a+$/i", "aAa")   is    true

Matched substring extraction

When we're interested in extracting substrings from matched portions of a larger string we use preg_match and preg_match_all with a third parameter used to capture the matched information. In these situations, we typically do not want to use the beginning and ending anchors. Consider the following program:
$patternStr = '[+-]?([1-9]\\d*)?\\d(\\.\\d{2})?'; 

$testStr = "AB +22 C -4.51  D  8.0";

preg_match( "/$patternStr/", $testStr, $matches );
print_r($matches[0]); echo "\n";

preg_match_all( "/$patternStr/", $testStr, $matches );
print_r($matches[0]); echo "\n";
A true return value indicates that some substring matches the pattern. In this first case $matches[0] captures the first match, i.e.,
$matches[0] = "+22"
In the second case using preg_match_all, the matching operation goes to the end of the test string, obtaining:
$matches[0] = array( "+22", "-4.51", "8", "0" )
You may consider that this procedure seems an odd use of $matches having $matches[0] hold everything. What about the $matches[1], ... ? The answer has to do with subpattern matches.

Subpattern matches

In many circumstances we're interested in subpatterns of a matched pattern. For example, consider the pattern and test string as inputs to the matching operations:
$patternStr = "([a-z]+)(\\d+)"; 
$testStr = "Ab c55 24 Hello3 a.2 8a bbb00";

preg_match( "/$patternStr/", $testStr, $matches );
print_r($matches);

preg_match_all( "/$patternStr/", $testStr, $matches );
print_r($matches);
The pattern represents a lower case letter sequence followed by digit sequence. The two parenthesized subpatterns separate the letter sequence from the digit sequence. As before, preg_match pertains only to the first match, storing into $matches:
Array( [0] => c55, [1] => c, [2] => 55 )
The "0" entry is the entire match, and entry n is the match of the n-th parenthesized subpattern. In the second example using preg_match_all makes $matches[0][k] be the full k-th match, and $matches[n][k] the n-th parenthesized portion of the k-th match. Thus $matches is
[0] => Array ( [0] => c55  [1] => ello3  [2] => bbb00 )
[1] => Array ( [0] => c    [1] => ello   [2] => bbb   )
[2] => Array ( [0] => 55   [1] => 3      [2] => 00    )

Simple replacement

Simple replacement uses preg_replace. The 3-argument call is:
preg_replace( PATTERN, REPLACEMENT_STRING, INPUT_STRING )
replaces all matching substrings by the replacement. A fourth integer parameter can be used to limit the number of replacements. Thus this code:
$patternStr = "([a-z]+)(\\d+)"; 
$testStr = "Ab c55 24 Hello3 a.2 8a bbb00";
$replace = "----";
echo preg_replace( "/$patternStr/", $replace, $testStr ), "\n";
echo preg_replace( "/$patternStr/", $replace, $testStr, 1 ), "\n";
yields the following output:
Ab ---- 24 H---- a.2 8a ----
Ab ---- 24 Hello3 a.2 8a bbb00

Replacement with callback

The preg_replace_callback affords the most general possibilities. The call looks like this:
preg_replace( "/PATTERN/", "my_callback_function", TARGET_STRING )
where my_callback_function is a function is defined as follows:
function my_callback_function($m) {
  // using    $m[0] = the entire matched portion
  // and/or,  $m[n] = the matched portion for the n-th parenthesis group
  return /* the replacement code */;
}
Therefore this code segment:
$patternStr = "([a-z]+)(\\d+)"; 
$testStr = "Ab c55 24 Hello3 a.2 8a bbb00";

function replaceOp($m) {
   return $m[1] . ":" . ($m[2]+1);
}

echo preg_replace_callback( "/$patternStr/", "replaceOp", $testStr ), "\n";
transforms Ab c55 24 Hello3 a.2 8a bbb00
into Ab c:56 24 Hello:4 a.2 8a bbb:1

Application to keyword search and highlight

The following web application imitates some of the features we presented in the Java-based WordSearch application. In this case we can read and display from a resticted set of files (the examples subdirectory) on the server side. The text in these files is presented in the web application along with a keyword search mechanism for highlighting desired keywords.

The idea behind (complete) keyword search is to create a regular expression which uses the word boundary anchors around the embedded keyword in a case-insensitive search. We want to replace matched instances with some sort of HTML-based highlight features perhaps with color and bolding.

Given a suitable $keyword we might use a replacement like this:
$color = // desired replacement color
function highlight($m) {
  global $color;
  $startTag = "<span style=\"color:$color;font-weight:bold\">";
  $endTag = "</span>";
  return $startTag . $m[0] . $endTag;
}
$rep1 = preg_replace_callback( "/\b$keyword\b/i", "highlight", $text );
In the sample program below, there are a number of other features of interest:
  1. Using an iframe to hold the display content. We access the iframe document via the JavaScript:
    window.frames[0].document
    
  2. Using the Dojo "color-picker" (different from the "color-chooser") widget. This example draws from the rich web enhancements available from the so-called dojox (dojo extension) repertoire.

    With proper initial loadings, the color-picker is instantiated by the single HTML element:
        <div id="pickerToo" dojoType="dojox.widget.ColorPicker"
          animatePoint="false"
          showHsv="false"
          showRgb="false"  
          webSafe="false"
          onchange="setColor(this.value)"
        ></div>
    
The point about the color-chooser is that it's a complex widget precreated which only needs plugging into to be of use. At issue is the additional client-side code requirements and implied bandwidth requirements.

index.php
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Highlighter</title> <link rel="stylesheet" type="text/css" href="/dojolib/dojox/widget/ColorPicker/ColorPicker.css" /> <script type="text/javascript" src="/dojolib/dojo/dojo.js" djConfig="parseOnLoad:true"></script> <script type="text/javascript" src="js/index.js"></script> <script type="text/javascript"> dojo.require("dojo.parser"); dojo.require("dojox.widget.ColorPicker"); var colorchosen function setColor(selected_color){ console.log(selected_color) colorchosen = true dojo.byId("highlight").color.value = selected_color } dojo.addOnLoad( function() { getList() colorchosen = false console.log(colorchosen) } ) </script> </head> <body> <form id="list"> File: <select name="file" onchange="getFile();return false;"></select> <button onclick="getFile();return false">Refresh</button> </form> <table> <tr valign="top"> <td> <div id="pickerToo" dojoType="dojox.widget.ColorPicker" animatePoint="false" showHsv="false" showRgb="false" webSafe="false" onchange="setColor(this.value)" ></div> </td> <td> <form id="highlight"> <input type="hidden" name="color" /> Keyword: <br /> <input type="text" name="keyword" /> <p> <button onclick="highlight();return false">Highlight</button> </p> </form> </td> </tr> </table> <iframe src="content.php" style="width:100%;height:400px"></iframe> </body> </html>

content.php
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title></title> </head> <body id="content"></body> </html>

js/index.js
function getList() { var select = dojo.byId('list').file; dojo.xhrGet( { url: 'handlers/get-list.php', sync: true, handleAs: "json", load: function (response, ioArgs) { var options = [] for (i = 0; i < response.length; ++i) { options[i] = new Option( response[i], response[i] ) } options = [ new Option( "-----", "" ) ].concat( options ) for (i = 0; i < options.length; ++i) { select[i] = options[i] } }, error: function (response, ioArgs) { alert("error: " + response) } } ); } function getFile() { var select = dojo.byId('list').file; var content = window.frames[0].document.getElementById("content") dojo.xhrGet( { url: 'handlers/get-file.php', content: { file: select.value }, load: function (response, ioArgs) { content.innerHTML = response }, error: function (response, ioArgs) { alert("error: " + response) } } ); } function highlight() { if (!colorchosen) { alert("must choose a color") return } var content = window.frames[0].document.getElementById("content") dojo.xhrPost( { url: "handlers/highlight.php", form: "highlight", content: { text: content.innerHTML }, load: function (response, ioArgs) { content.innerHTML = response }, error: function (response, ioArgs) { if (ioArgs.xhr.status == 420) { // expected error alert( "error: " + ioArgs.xhr.responseText ) } else { //unexpected error alert("error: " + response) } } } ); } function setColor(selectedColor) { dojo.byId('highlight').show_color.style.backgroundColor = selectedColor dojo.byId('highlight').color.value = selectedColor }

handlers/get-list.php
<?php $examples_dir = "../examples"; $dh = opendir( $examples_dir ) or die ("can't open $examples_dir"); while (($_ = readdir($dh)) !== false) { if ($_ == "." || $_ == ".." ) continue; $entry = htmlspecialchars($_); $entries[] = '"' . $entry . '"'; } closedir($dh); sort($entries); echo "[" . join(",", $entries) . "]";

handlers/get-file.php
<?php $file = $_GET['file']; $content = file_get_contents("../examples/$file"); $content = htmlspecialchars($content); $content = preg_replace("/\n\s*\n/", "<br /><br />", $content); echo $content;

handlers/highlight.php
<?php $keyword = trim($_POST['keyword']); $text = $_POST['text']; $color = $_POST['color']; if (!preg_match( "/^\w+$/", $keyword )) { header("HTTP/1.0 420 My Own Error Code" ); die( "illegal keyword field" ); } function highlight($m) { global $color; $startTag = "<span style=\"color:$color;font-weight:bold\">"; $endTag = "</span>"; return $startTag . $m[0] . $endTag; } echo preg_replace_callback( "/\b$keyword\b/i", "highlight", $text );


© Robert M. Kline