Text Processing

This document is based on the topics and examples in the textbook, Chapter 9. Additions to the textbook material include a more thorough discussion of regular expressions and the usage of GUIs.

Example Applications

The larger programming examples all are part of the Java Application TextProcessing which you should install from existing sources. Download the source archive
TextProcessing.zip
Install TextProcessing as a Java Application with Existing Sources. See the Using NetBeans document for details.

The TextProcessing application has multiple Main Classes intended to illustrate various independent features. The simplest way to run the various classes is to select Run File from the right-click menu invoked from the Projects window with the file selected, or from the Editor window itself. You could attempt to run the project, but you would need to reset the run configuration when you change Main Classes.

The examples package

The classes within represent additional test-programs.

The NetBeans default package for Textbook classes

The sample programs in the textbook are intended to be used without packages. In NetBeans' terms, this is regarded as the default package (which NetBeans suggests not using).

Character Matching

These 3 programs from the textbook make use of static functions from the Character wrapper class to analyze characters of a string.
CharacterTest  
CustomerNumber  
StringAnalyzer  

Textprocessing example

Run the class
textprocessing.CharMatch
by right-clicking and selecting Run File. You can re-run by activating the re-run button from the Output window. This is an adaptation of a program from the textbook. Type a string into the textfield, press the Check button. A popup tell you whether the string is "valid" or not. A valid string follows this pattern:
3 letters followed by 4 digits.
Our solution differs by obtaining the input from a textfield within a frame, instead of from an popup. The Main Class is as follows. The actual code is fully documented.

textprocessing.CharMatch
package textprocessing;
 
/* imports */
 
public class CharMatch {
 
  private final CharMatchFrame frame = new CharMatchFrame();
 
  public CharMatch() {
 
    frame.setTitle(getClass().getSimpleName());
 
    frame.getCheckButton().addActionListener(new ActionListener() {
      @Override
      public void actionPerformed(ActionEvent e) {
        JTextField inputField = frame.getInputField();
        String input = inputField.getText();
 
        boolean isValid = isValid1(input);
        //boolean isValid = isValid2(input);
 
        if (!isValid) {
          JOptionPane.showMessageDialog(frame, "invalid");
        } else {
          JOptionPane.showMessageDialog(frame, "valid");          
        }
      }
    });
  }
 
  private boolean isValid1(String custNumber) {
    if (custNumber.length() != 7) {
      return false;
    }
    for(int i = 0; i < 3; ++i) {
      char c = custNumber.charAt(i);
      if (!Character.isLetter(c)) {
        return false;
      }
    }
    for(int i = 3; i < 7; ++i) {
      char c = custNumber.charAt(i);
      if (!Character.isDigit(c)) {
        return false;
      }
    }
    return true;
  }
 
  private boolean isValid2(String custNumber) {
    return custNumber.matches("[a-zA-Z]{3}\\d{4}");
  }
 
  public static void main(String[] args) {
    CharMatch app = new CharMatch();
    app.frame.setVisible(true);
  }
}
The GUI frame has these interface functions:

views.CharMatchFrame
package views;
 
import javax.swing.JButton;
import javax.swing.JTextField;
 
public class CharMatchFrame extends javax.swing.JFrame {
 
  public JTextField getInputField() { return input; }
 
  public JButton getCheckButton() { return check; }
  ...
}
The frame uses the default Free Design layout with these two components added:
JTextField  input
JButton     check
The UML diagram would be:

CharMatchFrame

- javax.swing.JTextField input
- javax.swing.JButton check

+ CharMatchFrame()

+ getInputField()
+ getCheckButton()

The controller class assigns a button listener to the check button. When the button is pressed, the text keyed into the input textfield is obtained through the getInputField function and then matched against the pattern for validity using either isValid1 (default), or isValid2. The success or failure to match is reported by a JOptionPane message dialog.

Validity check functions

The isValid1 function follows the code in the textbook by checking each character in the string using static member functions from the Character class. Given the custNumber string, the character at a specific position i is retrieved by:
char c = custNumber.charAt(i);
These two static functions do the testing:
Character.isLetter(char c);
Character.isDigit(char c);
The isValid2 function uses the more sophisticated String.match function which matches according to regular expression patterns. In particular, the matching pattern
"[a-zA-Z]{3}\\d{4}"
precisely represents 3 letters followed by 4 digits.

String classes

A string is just a sequence of characters and Java has several string-related classes and interfaces which can be used:
interface CharSequence
  specifies the "char charAt(int)" prototype and others.
class String implements CharSequence
class StringBuffer implements CharSequence
class StringBuilder implements CharSequence
class CharBuffer implements CharSequence
class Segment implements CharSequence
There are many operations on strings or individual characters defined by the String and Character classes. In practice, it is a good idea to let NetBeans help you see what all is available. For example, go to any empty line within any function. To get a list of possible static member functions of the String class, type:
String.    
To get a list of the non-static member functions, type:
new String().    
From these lists you can start keying in possible member function one keystroke at a time. Type "m" to get a sublist containing matches:
public boolean matches(String regex)
Erase up to and including the "." and retype the "." to look for others. Try typing "s" to find:
public String[] split(String regex)

Regular Expressions

The textbook intoduces regular expressions in section 9.4 and uses these in examples with the split member function. We want to take it much further.

Regular Expressions are strings meant to serve as patterns for matching multiple strings. Every string of standard identifier characters a-z, A-Z, 0-9, _ or blanks (plus others) is a pattern matching itself only. For example:
2be_or not2BE
Regular Expressions employ many non-identifier characters as meta-characters to create patterns. These meta-characters include
. [ ] * + \ - { } ( ) ^ | $
Here are some of the usage concepts: Also important, but typically not necessary with Java regular expression matching are the so-called anchor expressions:
^     beginning anchor, forcing string to match beginning of pattern
$     end anchor, forcing string to match end of pattern
Here is a list of common sample patterns that we will use:
[a-z]       any lowercase letter
[A-Z]       any uppercase letter
[a-zA-Z]    any letter
[^a-zA-Z]   any character not a letter
[0-9]       any digit
[^0-9]      any character not a digit
\d          any digit, same as [0-9]
\s          any whitespace character
\w          an identifier character: letter, digit or underscore, same as [a-zA-Z0-9_]
\D          any character not a digit, same as [^0-9]
\S          any character not a whitespace
\W          any character not an identifier character
.           any character except newline
(.|\n)      any character — "|" means "or"
.*          any string not containing a newline
\s*         sequence of whitespace characters, possibly empty
\s+         non-empty sequence of whitespace characters
\d+         non-empty sequence of digits
\S+         non-empty sequence of characters, none a whitespace
\W+         non-empty sequence of characters, none an identifer character
\d{2}       exactly 2 digits
[a-z]{3,5}  between 3 and 5 lowercase letters
\.          a literal period charcter
\*          a literal asterisk character
\+          a literal plus character

Realizing a regular expression as a Java String

Every meta-character you see above can be used as is in a Java String as itself, except one, the escape character "\". This character, as you know, has a special meaning in order to correctly interpret "\n" as newline instead of a literal backslash followed by "n". The way to make the escape character be literal is to escape it. Consequently, every "\" used in a regular expression must be doubled as "\\" when put into a String.

The Java matches function

The call to matches is this:
some_string.matches(some_regex)
The return value is: The key concept is in the word "exactly." Other methods of matching used by other operators in Java as well as those of other programming languages express possible alternatives: As long as you know what is expected, regular-expression syntax is available to express precisely what you want.

Examples

We wrote the definition of this function as:
private boolean isValid2(String custNumber) {
  return custNumber.matches("[a-zA-Z]{3}\\d{4}");
}
There are alternatives for the matching statement. One avoids the backslash with the expression:
custNumber.matches("[a-zA-Z]{3}[0-9]{4}")
Another avoids the two character ranges by forcing lower case:
custNumber.toLowerCase().matches("[a-z]{3}\\d{4}")
In all cases the string length is implicitly 7, required by the match being complete, e.g.,
"aBc1234".matches("[a-zA-Z]{3}\\d{4}") == true
"BEFORE aBc1234 AFTER".matches("[a-zA-Z]{3}\\d{4}") == false 
You can create partial match within a single line (no newlines) by adding this expression before and/or after the desired match expression:
.*
For example:
"BEFORE aBc1234 AFTER".matches(".*[a-zA-Z]{3}\\d{4}.*") == true 
Here's another test class. Again, it's useful to key some of this in by hand, particularly the regular expressions which are new to you:

examples.MatchTest
package examples;
 
public class MatchTest {
 
  public static void main(String[] args) {
 
    String test_str;
    String pattern_str;
 
    test_str = "aBc1234";
    pattern_str = "[a-zA-Z]{3}\\d{4}";
 
    System.out.println( "1: " + test_str.matches(pattern_str) );
 
    test_str = "BEFORE aBc1234 AFTER";
    System.out.println( "2: " + test_str.matches(pattern_str) );
 
    pattern_str = ".*[a-zA-Z]{3}\\d{4}.*";
    System.out.println( "3: " + test_str.matches(pattern_str) );
  }
}
select

The RegExr Learning Tool

The Web Application application found at the site (on our course site):
RegExr Learning Tool
is a great help for learning how to use regular expressions. Enter a regular expression inside
/THE_EXPRESSION/g
It is JavaScript-driven and so results are automatic. The matching mechanism is partial, meaning that it will match substrings. The "g" at the end means general and instructs the matcher to generate all possible matches. If you remove the "g", you'll only get the first match. Here are some testing examples which you should key in:

[abc]
[a-z]
[az-]
[A-Z]
[A-Z]+
[a-z0-9]
[a-z0-9]+
\d{3}
\d{2,4}
\d{3,}
\d+
\w+
\D+
\W+
\W+\d+
\s+\d+
[^a-z]
[^a-z]+
full
.*full
full.*
full.*\n+.*
\s+
\S+
\S\s+
.
\.
\..

In addition to trying various regular expressions, you can also edit the text provided to show the outcome.

StringBuilder / StringBuffer

These classes provide efficient ways of constructing Strings. A String per se, is considered to be an immutable object. Once we create, say
String str = "hello";
no more alterations can be performed on the str object. Therefore a sequence of operations like this:
String construction = "";
construction += "AAAAAAAAAAA ";
construction += "BBBBBBBB ";
construction += "CCCC";
...
String answer = construction;
is inefficient because a new string must be constructed at each step, starting from a duplication of the previous part. In contrast, StringBuilder objects are effectively sequences of characters which can be altered without duplicating the previous part. Therefore, this equivalent sequence is much more efficient:
StringBuilder construction = new StringBuilder();
construction.append("AAAAAAAAAAA ");
construction.append("BBBBBBBB ");
construction.append("CCCC");
...
String answer = construction.toString();
The StringBuffer class has exactly the same functionality as StringBuilder except that it is thread-safe, meaning that if two Java threads are operating concurrently on the same StringBuffer object, the result is guaranteed to not be corrupted by the concurrent access. These issues of concurrency are the purview of the Operating Systems course.

As a simple test of usage, create this main class within the examples package. You'll learn better if you key in new operations

StringBuilderTest.java
package examples;
 
public class StringBuilderTest {
 
  public static void main(String[] args) {
    StringBuilder construction = new StringBuilder();
    construction.append("AAAAAAAAAAA ");
    construction.append("BBBBBBBB ");
    construction.append("CCCC");
 
    System.out.println(construction);
 
    construction.replace(0, 8, "********");
 
    System.out.println(construction);
 
    String answer = construction.toString();
    //System.out.println(answer);
  }
}
select
As a side note, many of the StringBuilder methods like append and replace return StringBuilder objects instead of void, which means that they employ method chaining. The 3 append operations could be written equivlently like this:
construction.append("AAAAAAAAAAA ").append("BBBBBBBB ").append("CCCC");

Textbook StringBuilder Program

This class/main class example illustrates using the insert and delete members of StringBuilder to manipluate a String.
Telephone  
TelephoneTester  
The main program has two parts:
  1. The first part converts an informatted telephone number, like 1112223333, into its formatted format, (111)222-3333. It uses the static Telephone.format function, inserting the characters "(", ")", and "-" at the right positions with the StringBuilder insert operation.
  2. The second part converts a formatted telephone, like (111)222-3333 into the unformatted form, 1112223333. It uses the static Telephone.unformat function, removing the characters "(", ")", and "-" at the right positions with the StringBuilder deleteCharAt operation.

Other String-related classes

The CharBuffer class is based on a char array. This class is new to Java 1.7, and its usage is presumably more efficient than StringBuilder in some circumstances. The Segment class is based on a char array. The specialization of this object is like a String representing a portion of a larger text and is efficient by avoiding the copy operations necessary to make an actual String.

The split member function

The split function is the second most important regular expression functions (following matches) in the String class. This function obviates the necessity of the StringTokenizer class discussed in the textbook. StringTokenizer is referred to as a legacy class.

The goal is to split some text into its "parts," which in simplest terms means non-whitespace substrings. The usage is like this:
String test_str = /* ... */;
String regex = /* ... */;

String[] pieces = test_str.split(regex);
Again, here's a sample main class for testing:

examples.SplitTest
package examples;
 
import java.util.Arrays;
 
public class SplitTest {
 
  public static void main(String[] args) {
 
    String test_str = "Here is a    very common sentence - hope you like it!";
    String regex = "\\s+";
 
    String[] pieces = test_str.split(regex);
 
    System.out.println( Arrays.toString(pieces) );
  }
}
select
Also try "[^a-z]" and others for regex

Textbook split examples

These are rather simplistic. The first two split on regular expressions consisting of literal strings. The second two split on single characters defined by character sets.
SplitDemo1  
SplitDemo2  
SplitDemo3  
SplitDemo4  

SplitString Example

This non-GUI application illustrates features of both the String split method and the StringBuilder class. To make it more interesting, the String with which we work is obtained from a text file held in the application directory's root. The file is the text of Tell Tale Heart by Edgar Allen Poe. We want to split this file into "pieces" and show the user the pieces.

Run the file:
textprocessing.SplitString
The program is this:

textprocessing.SplitString
package textprocessing;
 
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
 
/**
 * @author Robert Kline
 */
public class SplitString {
 
  /**
   * read the file "tell-tale-heart.txt" and "tokenize" it,
   * generating an array of Strings split by the given 
   * regular expression
   * 
   * @param args the command line arguments
   * @throws java.io.IOException: if the file does not exist
   */
  public static void main(String[] args) throws IOException {
    String target = "tell-tale-heart.txt";
    String content = new String(Files.readAllBytes(Paths.get(target)));
 
    String regex;
    regex = "\\s+";  // whitespace sequences
    //regex = "\\W+";  // non-identifier sequences
    //regex = "\\n+";  // newline sequences
 
    // spit the file content string by regex sequences
    String[] pieces = content.split(regex);
 
    // construct the format sting to be used to print lines
    Integer num_pieces = pieces.length;
    int field_len = num_pieces.toString().length();
    String format_str = "%" + field_len + "d: %s";
 
    System.out.format("format string: \"%s\"\n", format_str);
 
    StringBuilder output = new StringBuilder();
    int line = 0;
    for (String piece : pieces) {
      String str = String.format(format_str + "\n", ++line, piece);
      output.append( str );
    }
    System.out.println(output.toString());
  }
}
There are basically 3 steps:
  1. Read the target file into a content string using a Java 1.7 helper function:
    String target = /* target file name */;
    String content = new String(Files.readAllBytes(Paths.get(target)));
  2. Remove occurrences of regex from the content, leaving an array of strings:
    String[] pieces = content.split(regex);
  3. Loop through the pieces array, printing to standard output.
When you use the regular expression string "\\s+" the outcome is what would happen by using StringTokenizer: removes all whitespace sequences leaving the "words" of the text.

Try the alternatives, uncommenting this one:
regex = "\\W+";
You'll see the difference in that you get the "real" words, minus the unwanted punctuation. The last variation illustrates how to get the lines, test by uncommenting this:
regex = "\\n+";
Here are other points about the program:
  1. The file we are using, tell-tale-heart.txt, resides at the top level of the application; you can see it the Files window. We read it into a string using code which employs Java 1.7 Files and Paths classes:
    String target = "tell-tale-heart.txt";
    String content = new String(Files.readAllBytes(Paths.get(target)));
    We will discuss File I/O later in the course.
  2. A new thing is the need to deal with an error if the file does not exist. Java treats such errors as exceptions, another topic to be discussed later in the course. Reading the file can generate (throw) an certain IOException. For now we mostly need to make the code "recognize" that this exception can happen by modifying the main function declaration:
    public static void main(String[] args) throws IOException
  3. Each piece is presented in one line of the form:
    String str = String.format(format_str + "\n", ++line, piece);
    These str pieces are appended to an output string using the StringBuilder append operation.
  4. The display string displays the line number and piece with a format like this:
    "%Nd: %s"
    
    The N controls the size of the digit field in which the line number is entered; blank padding is added to fill out to N characters. The effect is to make the string pieces all line up in the output. In this case, since there are 2156 words, the desired format string turns out to be:
    "%4d: %s"
    
    Computation of the desired size is done by computing the string length of the number of pieces:
    Integer num_pieces = pieces.length;
    int field_len = num_pieces.toString().length();
    String format_str = "%" + field_len + "d: %s";

String Search Methods

Finding a substring within a string is a common task. There are several ways to do so. These String methods are common for searching:
boolean contains(CharSequence s)
boolean startsWith(String prefix)
boolean startsWith(String prefix, int index)
boolean endsWith(String suffix)
int indexOf(String str)
int indexOf(String str, int fromIndex)
int lastIndexOf(String str)
int lastIndexOf(String str, int fromIndex)
Perhaps the most functional of all are the indexOf and lastIndexOf operations in these forms
int indexOf(String str, int fromIndex)
int lastIndexOf(String str, int fromIndex)
In both cases you enter the substring being searched for, str, as well as where to start the search, fromIndex. The substring is recognized as being found if the value returned is a non-negative number, in which case, the value is the starting position of the substring. The value -1 is returned if the substring is not found.

The two-argument indexOf function can be used to find all substrings of a given string by manipulating the fromIndex, moving it from 0 past every successful match.

Textbook Example using the startsWith function

PersonSearch  

IndexOf and SubString Methods

Here is an illustrative "tester program":

examples.StringSearchTest
package examples;
 
public class StringSearchTest {
  public static void main(String[] args) {
    String srch = "0123abc78abc";
    String str = "abc";
 
    System.out.format("srch: %s\nstr:  %s\n", srch, str);
    System.out.println();
 
    int find_pos;
 
    // indexOf, lastIndexOf from 0
 
    find_pos = srch.indexOf(str);
    System.out.format("%s.indexOf(%s) = %d\n", srch, str, find_pos);
 
    find_pos = srch.lastIndexOf(str);
    System.out.format("%s.lastIndexOf(%s) = %d\n", srch, str, find_pos);
 
    System.out.println();
 
    // indexOf, lastIndexOf from other positions
 
    for (int start : new int[]{4, 5, 10}) {
      find_pos = srch.indexOf(str, start);
      System.out.format("%s.indexOf(%s,%s) = %s\n", srch, str, start, find_pos);
    }
 
    System.out.println();
 
    for (int start : new int[]{9, 8, 4, 3}) {
      find_pos = srch.lastIndexOf(str, start);
      System.out.format("%s.lastIndexOf(%s,%s) = %s\n",
          srch, str, start, find_pos);
    }
 
    System.out.println();
 
    // repeated indexOf calls to find all substrings:
    // you may want to use different strings here for testing:
    // srch = ...
    // str = ...
 
    System.out.printf("%s found in %s at positions:\n", str, srch);
    int search_start = 0;
    do {
      int index = srch.indexOf(str, search_start);
      if (index == -1) {
        break;
      }
      System.out.println(index);
      search_start = index + 1;
    }
    while (true);
  }
}
select

Repeated Search

Let's focus on the last section of the code:
    // srch = ...
    // str = ...
 
    System.out.printf("%s found in %s at positions:\n", str, srch);
    int search_start = 0;
    do {
      int index = srch.indexOf(str, search_start);
      if (index == -1) {
        break;
      }
      System.out.println(index);
      search_start = index + 1;
      //search_start = index + str.length();
    }
    while (true);
Running it with the given inputs
srch = "0123abc78abc";
str = "abc";
produces the expected result:
abc found in 0123abc78abc at positions:
4
9
Suppose we try these inputs:
srch = "aaaaaaaa";
str = "aaa";
produces the result:
aaa found in aaaaaaa at positions:
0
1
2
3
4
The outcome reflects advancing the search_start index only one position up from the previous match. In some situations, we want to exclude overlapped searches, in which case we would want to advance the search_start index past the matched substring. To do so, change the update statement to:
      search_start = index + str.length();
With this in place we would get a new outcome:
aaa found in aaaaaaa at positions:
0
3

SubstringSearch Example

This section describes the GUI program SubstringSearch based on using the two-argument indexOf function on strings to sequentially highlight all substrings of a given search string found in the text of Tell Tale Heart by Edgar Allen Poe. Run:
textprocessing.SubstringSearch
To use this application:

GUI Frame Construction

The layout used in this frame is a BorderLayout which gives the application the ability to automatically resize the textarea when the application is resized. The top panel containing the main controls expands horizontally and the content area in the center expands both horizontally and vertically.

Here are the details of how to construct the SearchFrame used for this application:
  1. Create a new JFrame Form in the package views with class name SearchFrame.
  2. Right-click on the frame and select Set Layout ⇾ Border Layout.
  3. From Swing Controls in the Palette, drag a TextArea into the middle of the frame. It should expand and take up the entire frame.
  4. From Swing Containers in the Palette, drag a Panel into the frame at the top of the frame. The layout should "open up" and allow you to drop the Panel into its own area at the top.
  5. Double-click on the newly added Panel to isolate it.
  6. Drag and drop onto the Panel: a TextField, 2 Buttons. Reduce the height of the Panel.
  7. From the Navigator window, double-click to select the JFrame.
  8. Right-click on the TextArea, changing its variable name to target.
  9. Right-click on the TextField, changing its variable name to search, and deleting the initial text.
  10. Change the variable names of the buttons to be find and reset. Then change the respective texts to be Find and Reset.
  11. Go into source mode. Add these imports and interface functions as indicated:

    views.SearchFrame
    package views;
     
    import java.awt.Color;
    import javax.swing.JButton;
    import javax.swing.JTextArea;
    import javax.swing.JTextField;
    import javax.swing.text.BadLocationException;
    import javax.swing.text.DefaultHighlighter;
    import javax.swing.text.Highlighter;
     
    public class SearchFrame extends javax.swing.JFrame {
     
      public JTextArea getTargetTextArea() { return target; }
     
      public JTextField getSearchTextField() { return search; }
     
      public JButton getFindButton() { return find; }
     
      public JButton getResetButton() { return reset; }
     
      private final Color highlightColor = Color.decode("#66ffcc");
     
      private final Highlighter.HighlightPainter myPainter 
        = new DefaultHighlighter.DefaultHighlightPainter(highlightColor);
     
      public void setHighlights(int start, int end) {
        try {
          Highlighter highlighter = target.getHighlighter();
          highlighter.addHighlight(start, end, myPainter);
          target.setCaretPosition(start);
        }
        catch (BadLocationException e) {
          e.printStackTrace(System.err);
          System.exit(1);
        }
      }
     
      public void clearHighlights() {
        target.getHighlighter().removeAllHighlights();
      }
     
      ...
    }
The GUI employs a very specialized Swing object named myPainter to colorize the area between two positions in a textarea. The work is done in the setHighlights member. You should ignore the details for now.

Controller Class

The controller class follows the ideas of the last sample code in the previous section to find a sequence of all substrings. The difference is that the sequencing is event-driven by clicking the Find button.

textprocessing.SubstringSearch
package textprocessing;
 
import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import javax.swing.JOptionPane;
import views.SearchFrame;
 
public class SubstringSearch {
 
  private final SearchFrame frame = new SearchFrame();
 
  private int search_start = 0;
 
  private String search_string;
 
  public SubstringSearch() throws IOException {
    frame.setTitle(getClass().getSimpleName());
    frame.setSize(750, 500);
    frame.setLocationRelativeTo(null);
 
    String load_file = "tell-tale-heart.txt";
    String content = new String(Files.readAllBytes(Paths.get(load_file)));
 
    // set initial content from file content
    frame.getTargetTextArea().setText(content);
 
    frame.getFindButton().addActionListener(new ActionListener() {
      @Override
      public void actionPerformed(ActionEvent e) {
        frame.clearHighlights();
 
        // read search_string from textarea, avoid case issues
        search_string = frame.getTargetTextArea().getText().toLowerCase();
 
        // read search_from from textarea, avoid case issues
        String search_for = frame.getSearchTextField().getText().toLowerCase();
 
        int find_pos = search_string.indexOf(search_for, search_start);
        if (find_pos == -1) {
          JOptionPane.showMessageDialog(frame, "string not found");
        }
        else {
          frame.setHighlights(find_pos, find_pos + search_for.length());
          search_start = find_pos + 1;
        }
      }
    });
 
    frame.getResetButton().addActionListener(new ActionListener() {
      @Override
      public void actionPerformed(ActionEvent e) {
        frame.clearHighlights();
        // scroll textarea back to the top
        frame.getTargetTextArea().setCaretPosition(0);
        // reset search_string to pick up changes, and reset start position
        search_string = frame.getTargetTextArea().getText().toLowerCase();
        search_start = 0;
      }
    });
  }
 
  public static void main(String[] args) throws IOException {
    SubstringSearch app = new SubstringSearch();
    app.frame.setVisible(true);
  }
}

Sophisticated Java Pattern Matching

The Pattern and Matcher classes

For more sophisticated pattern search operations, Java uses the classes Pattern and Matcher in the package
java.util.regex
These two classes are combined as follows:
String pattern_string = /* some regular expression */;
String search_string  = /* the string to search for pattern matches */;
 
Pattern pattern = Pattern.compile(pattern_string);
Matcher matcher = pattern.matcher(search_string);
The Pattern.compile operation can use a second parameter to specify other features of the intended matching operation. A common usage of the second parameter is to ensure that matches are case-insensitive by defining:
Pattern pattern 
  = Pattern.compile(pattern_string, Pattern.CASE_INSENSITIVE);

Finding partial matches

The matching operation is initiated by the call which will search for a partial match of the pattern anywhere within the string.
matcher.find()
One useful feature of the matcher.find() is that it tells the start and end position of the substring found to match:
int start = matcher.start();
int end = matcher.end();
Another important feature is that it keeps track of the last position found so that a subsequent call will search from that last position. Thus we can use the following loop to find all matches within the search string:
while (matcher.find()) {
  int begin = matcher.start();
  int end = matcher.end();
}
A final important feature (which we don't use in our example) is the ability to obtain the matched portion of the substring through the
String matched_portion = matcher.group();
Here is a simple test program which you can drop into a temporary new Java Main Class. After you do so, run Fix Imports to add the necessary Java imports.

PatternMatchTest
package examples;
 
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class PatternMatchTest {
 
  public static void main(String[] args) {
    String search_string = "1234abc89zw....";
    String pattern_string = "[a-z]+";
 
//    search_string = "...";
//    pattern_string = "...";
 
    System.out.println("search string : " + search_string);
    System.out.println("pattern string: " + pattern_string);
 
    System.out.println("");
 
    Pattern pattern = Pattern.compile(pattern_string);
    Matcher matcher = pattern.matcher(search_string);
    while (matcher.find()) {
      int start = matcher.start();
      int end = matcher.end();
      String matched = matcher.group();
      System.out.format("matched=%s,start=%d,end=%d\n", matched, start, end);
    }
  }
}
select

Repeated Matches

Here the key structure for repeated searching is this:
    Pattern pattern = Pattern.compile(pattern_string);
    Matcher matcher = pattern.matcher(search_string);
    while (matcher.find()) {
      String matched = matcher.group();  // matched substring
    }
Try the same experiment we did for substring search, by resetting:
    search_string = "aaaaaaa";
    pattern_string = "aaa";
The result is:
search string : aaaaaaa
pattern string: aaa

matched=aaa, start=0, end=3
matched=aaa, start=3, end=6

RegexSearch Example

This section describes the GUI program RegexSearch. It uses the exact same frame as used in the SubStringSearch; the difference is that the search field read is understood as a regular expression, not a substring. Thus the classes involved are:
textprocessing.RegexSearch
views.SearchFrame
This example thereby illustrates the value of making the GUI frame "reusable," by keeping the event handler code out of the view class code. The controller code is different, but the GUI frame is identical.

Run the application

The application is meant to be similar to (although not as slick as) the web application RegExr Learning Tool. Here is the controller class:

textprocessing.RegexSearch
package textprocessing;
 
import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.swing.JOptionPane;
 
import views.SearchFrame;
 
public class RegexSearch {
 
  private final SearchFrame frame = new SearchFrame();
 
  private String search_string;
 
  public RegexSearch() throws IOException {
    frame.setTitle(getClass().getSimpleName());
    frame.setSize(600, 500);
    frame.setLocationRelativeTo(null);
 
    String load_file = "testing.txt";
    String content = new String(Files.readAllBytes(Paths.get(load_file)));
 
    // set initial content from file content
    frame.getTargetTextArea().setText(content);
 
    frame.getFindButton().addActionListener(new ActionListener() {
      @Override
      public void actionPerformed(ActionEvent ae) {
        frame.clearHighlights();
 
        // read search_string from textarea
        search_string = frame.getTargetTextArea().getText();
 
        // search_for is a regular expression string
        String search_for = frame.getSearchTextField().getText(); 
 
        if (search_for.isEmpty()) {
          JOptionPane.showMessageDialog(frame, "search string cannot be empty");
          return;
        }
 
        Pattern pattern = Pattern.compile(search_for);
        Matcher matcher = pattern.matcher(search_string);
 
        // highlight all matched substrings
        while (matcher.find()) {
          int begin = matcher.start();
          int end = matcher.end();
 
          frame.setHighlights(begin, end);
        }
      }
    });
 
    frame.getResetButton().addActionListener(new ActionListener() {
      @Override
      public void actionPerformed(ActionEvent e) {
        frame.clearHighlights();
        // scroll textarea back to the top
        frame.getTargetTextArea().setCaretPosition(0);
        // reset search_string to pick up changes
        search_string = frame.getTargetTextArea().getText();
      }
    });
 
  }
 
  public static void main(String[] args) throws IOException {
    RegexSearch app = new RegexSearch();
    app.frame.setVisible(true);
  }
}

Sub-pattern extraction

Parentheses can be used in regular expression matching to identify sub-matches of a matched expression. For example, consider this test program:

examples.PatternExtractTest
package examples;
 
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class PatternExtractTest {
 
  public static void main(String[] args) {
    String search_string = "aabb cc3333 ddd 2223e44  fff5gg  hh";
    String pattern_string = "([a-z]+)(\\d+)";
 
    Pattern pattern = Pattern.compile(pattern_string);
    Matcher matcher = pattern.matcher(search_string);
 
    while (matcher.find()) {
      System.out.format("matched substring: %s\n", matcher.group());
      System.out.format(
              "matched sub-patterns: %s,%s\n", matcher.group(1), matcher.group(2));
      System.out.println("-----------------------");
    }
  }
}
select
The output is
matched substring: cc3333
matched sub-patterns: cc,3333
-----------------------
matched substring: e44
matched sub-patterns: e,44
-----------------------
matched substring: fff5
matched sub-patterns: fff,5
-----------------------
The new thing we're seeing here is
matcher.group(N)
for some integer N. What the N refers to the N-th parenthesized group within the full pattern. In this case the match pattern is:
([a-z]+)(\d+)
Ignoring the parentheses, it means a sequence of one or more lower-case letters followed by a sequence of one or more digits. Adding the parentheses identifies each part of the match, which we can then extract from the full substring.


© Robert M. Kline