Monday 3 September 2012

Java Regular Expressions

Regular Expressions (Regex) are a really powerful tool for doing string manipulation and searching but they can be complicated!  Here are some basics.

Pattern

Creating a Pattern object in Java is a prerequisite to getting Regex going.  This can be created using code such as

    Pattern pattern = Pattern.compile(<Regex>);
    Matcher matcher = pattern.matcher(<String to check>);
    if (matcher.matches())
    {
        // Do something
    }

or
    while (matcher.find())
    {
        String found = matcher.group()
        // Do something
    }

Basic Regex

Regular expressions normally include some of the following


  • [0-9] = any of 0, 1, ..., 9 - any number of times
  • [0-9a-zA-Z] = any of 0, 1, ..., 9, a, b, ..., z, A, B, ..., Z - any number of times
  • [0-9]{3} = any of 0, 1, ..., 9 exactly 3 times
  • [0-9]{3,6} = any of 0, 1, ..., 9 between 3 and 6 times
  • [^0-9]  = any character that isn't 0, 1, ..., 9 eg negation
  • . = any character one time
  • .* = any character any number of times (0 or more)
  • .+ = any character at least once (1 or more)
  • \w = any word character
  • \d = any digit character
  • \s = whitespace character

Some characters need to be escaped if you want to use them in the regex itself.  These are
  • {}[]().*\

Groups

Groups can be a really useful and powerful concept.  If you want to replace elements of the regex but keep some bits then groups can help you. Placing parts of the regex in normal brackets will mean that they are available as a group after the match.  For example

    regex = "(hello)(\s*world)";

will match on 'hello world' but the elements are available in group $1 and $2.  They can be obtained from the matcher (see above) using

    String firstGroup = matcher.group(1);

And similarly with a replace request such as,

    matcher.replaceAll("Goodbye$2");


would replace the matched string with Goodbye and then whatever is in group 2, which is ' world'.  This way matcher can replace elements of what has been matched.


Examples


Replace all non-standard characters

    Pattern pattern = Pattern.compile("[^a-zA-Z0-9]");
    Matcher matcher = pattern.matcher(myWeirdString);
    matcher.replaceAll("");

Match all non-standard characters using the Pattern and Matcher.  Then just call replace all to get rid of everything matched!


Thread Safety

Pattern objects are thread safe but matcher objects aren't!  From the doc...

Instances of this (Pattern) class are immutable and are safe for use by multiple concurrent threads. Instances of the Matcher class are not safe for such use.

Online Help!

These tools are really useful for checking regular expressions!
https://regex101.com/
http://rubular.com/


No comments:

Post a Comment