Consult Chapter 4 for a list of the regular expression characters that the Apache Regular Expression API matches.
Table 4-2. Regular expression syntax
Subexpression |
Will match: |
Notes |
---|---|---|
General | ||
|
The letter | |
|
Start of line/string | |
|
End of line/string | |
|
Any one character | |
|
“Character class”; any one character from those listed | |
|
Any one character not from those listed | |
Normal (greedy) multipliers (“greedy closures”) | ||
|
Multiplier (closure) for from | |
|
Multiplier for from | |
|
Multiplier for 0 up to | |
|
Multiplier for 0 or more repetitions |
Short for |
|
Multiplier for 1 or more repetitions |
Short for |
|
Multiplier for 0 or 1 repetitions |
Short for |
Reluctant (non-greedy) multipliers (“reluctant closures”) | ||
|
Reluctant multiplier: 0 or more | |
|
Reluctant multiplier: 1 or more | |
|
Reluctant multiplier: 0 or 1 times | |
Alternation and grouping | ||
|
Grouping | |
|
Alternation | |
Escapes and shorthands | ||
|
Escape character: turns metacharacters off, and turns following
alphabetics ( | |
|
Tab character | |
|
Character in a word |
Use |
|
Numeric digit |
Use |
|
Whitespace |
Space, tab, etc., as determined by
|
|
Inverse of above ( | |
POSIX-style character classes | ||
|
Alphanumeric characters | |
|
Alphabetic characters | |
|
Space and tab characters | |
|
Space characters | |
|
Control characters | |
|
Numeric digit characters | |
|
Printable and visible characters (not spaces) | |
|
Printable characters | |
|
Punctuation characters | |
|
Lowercase characters | |
|
Uppercase characters | |
|
Hexadecimal digit characters | |
|
Start of a Java language identifier |
Not in POSIX |
|
Part of a Java identifier |
Not in POSIX |
These pattern characters can be used in any combination that makes
sense. For example, a+
means any number of
occurrences of the letter a
, from one up to a
million or a gazillion. The pattern Mrs?\.
matches
Mr.
or Mrs.
. And,
.*
means
“any character, any number of times,” and is similar in
meaning to most command-line interpreters’ meaning of *.
It’s important to remember that REs will match anyplace possible in the input, and that patterns ending in a greedy closure will consume as much as possible without compromising any other subexpressions.
Also, unlike some RE packages, the Apache package was designed to handle Unicode characters from the beginning. Actually, it came for free, as its basic units are the Java char and String variable, which are Unicode-based. In fact, the standard Java escape sequence \unnnn is used to specify a Unicode character in the pattern. And we use methods of java.lang.Character to determine Unicode character properties, such as whether or not a given character is a space.
Get Java Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.