Suppose you have been on the Internet for a few years and have been very faithful about saving all your correspondence, just in case you (or your lawyers, or the prosecution) need a copy. The result is that you have a 50-megabyte disk partition dedicated to saved mail. And let’s further suppose that you remember that there is one letter, somewhere in there, from someone named Angie or Anjie. Or was it Angy? But you don’t remember what you called it or where you stored it. Obviously, you will have to go look for it.
But while some of you go and try to open up all 15,000,000 documents in a word processor, I’ll just find it with one simple command. Any system that provides regular expression support will allow me to search for the pattern:
An[^ dn]
in all the files. The “A” and the “n” match
themselves, in effect finding words that begin with “An”,
while the cryptic [^
dn]
requires the “An” to be followed by a character other
than a space (to eliminate the very common English word
“an” at the start of a sentence) or “d” (to
eliminate the common word “and”) or “n” (to
eliminate Anne, Announcing, etc.). Has your word processor gotten
past its splash screen yet? Well, it doesn’t matter, because
I’ve already found the missing file. To find the answer, I just
typed the
command:[14]
grep 'An[^ dn]' *
Regular expressions, or REs for short, provide a concise and precise specification of patterns to be matched in text. Java 2 did not include any facilities for describing regular expressions in text. This is mildly surprising given how powerful regular expressions are, how ubiquitous they are on the Unix operating system where Java was first brewed, and how powerful they are in modern scripting languages like sed, awk, Python, and Perl.
At any rate, there were no RE packages for Java when I first learned
the language, and because of this, I wrote my own RE package. More
recently, I had planned to submit a JSR[15] to Sun Microsystems, proposing to add to Java a regular
expressions API similar to the one used in this chapter. However, the
Apache Jakarta Regular Expressions
project[16] has achieved sufficient momentum to become nearly a
standard, but without the politics and meetings required of a JSR.
Accordingly, my JSR has not been submitted yet. Conveniently, the
Jakarta folk used a similar syntax to mine, so I was mostly able to
migrate to theirs just by changing the imports. However, the Apache
code is vastly more efficient than mine and should be used whenever
possible. Mine was written for pedagogical display, and compiles the
RE into an array of SubExpression
objects. The
Jakarta package, borrowing a trick from Java,[17] compiles to an array of
integer commands, making it run much faster: around a factor of 3 or
4, even for simple cases like searching for the string
“java” in a few dozen files. There are in fact a half
dozen or so regular expression packages for
Java; see Table 4-1.
Table 4-1. Java RE packages
Package |
Notes |
URL |
---|---|---|
Richard Emberson’s |
Unknown license; not being maintained. |
None; posted to advanced-java@berkeley.edu |
Ian Darwin’s RE |
Simple, but SLOW. Incomplete; didactic. | |
Apache Jakarta RegExp (original by Jonathan Locke) |
Apache (BSD-like) license. | |
Apache Jakarta ORO |
Apache license. More comprehensive? | |
Daniel Savarese |
Unknown. | |
“GNU Java Regexp” |
GPL; fairly fast. |
http://www.gjt.org (Giant Java Tree) |
The syntax of REs themselves is discussed in Section 4.2, hints on using them in Section 4.3, and the syntax of the Java API for using REs in Section 4.4.
[14] Non-Unix fans
rejoice, for you can do this on Win32 using a package alternately
called
CygWin (after Cygnus Software) or
GnuWin32 (http://sources.redhat.com/cygwin/). Or you
can use my Grep
program in Section 4.9 if you don’t have grep on your system.
Incidentally, the name grep comes from an ancient Unix line editor
command g/RE/p
, the command to globally find the
RE (regular expression) in all lines in the edit buffer and print the
lines that match: just what the grep program does to lines in
files.
[15] A JSR is a Java Standards Request, the process by which new standards are submitted by the Java Community and discussed in public prior to adoption. See Sun’s Java Community web site (http://developer.java.sun.com/developer/community/).
[16] Apache has, in fact, two regular expressions packages. The second, Oro, provides full Perl5-style regular expressions, AWK-like regular expressions, glob expressions, and utility classes for performing substitutions, splits, filtering filenames, etc. This library is the successor to the OROMatcher, AwkTools, PerlTools, and TextTools libraries from ORO, Inc. (http://www.oroinc.com).
[17] Java perhaps got the idea from the UCSD P-system, which used portable bytecodes in the early 1980s and ran on all the popular microcomputers of the day.
Get Java Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.