BUY THIS BOOK
Add to Cart

PDF $6.99

Safari Books Online

What is this?

Looking to Reprint or License this content?


Regular Expression Pocket Reference
Regular Expression Pocket Reference

By Tony Stubblebine

Cover | Table of Contents


Table of Contents

Chapter 1: Regular Expression Pocket Reference
Regular expressions (known as regexps or regexes) are a way to describe text through pattern matching. You might want to use regular expressions to validate data, to pull pieces of text out of larger blocks, or to substitute new text for old text.
Regular expression syntax defines a language you use to describe text. Today, regular expressions are included in most programming languages as well as many scripting languages, editors, applications, databases, and command-line tools. This book aims to give quick access to the syntax and pattern-matching operations of the most popular of these languages.
This book starts with a general introduction to regular expressions. The first section of this book describes and defines the constructs used in regular expressions and establishes the common principles of pattern matching. The remaining sections of the book are devoted to the syntax, features, and usage of regular expressions in various implementations.
The implementations covered in this book are Perl, Java, .NET and C#, Python, PCRE, PHP, the vi editor, JavaScript, and shell tools.
The following typographical conventions are used in this book:
Italic
Used for emphasis, new terms, program names, and URLs
Constant width
Used for options, values, code fragments, and any text that should be typed literally
Constant width italic
Used for text that should be replaced with user-supplied values
The world of regular expressions is complex and filled with nuance. Jeffrey Friedl has written the definitive work on the subject, Mastering Regular Expressions (O'Reilly), a work on which I relied heavily when writing this book. As a convenience, this book provides page references to Mastering Regular Expressions, Second Edition (MRE) for expanded discussion of regular expression syntax and concepts.
This book simply would not have been written if Jeffrey Friedl had not blazed a trail ahead of me. Additionally, I owe him many thanks for allowing me to reuse the structure of his book and for his suggestions on improving this book. Nat Torkington's early guidance raised the bar for this book. Philip Hazel, Ron Hitchens, A.M. Kuchling, and Brad Merrill reviewed individual chapters. Linda Mui saved my sanity and this book. Tim Allwine's constant regex questions helped me solidify my knowledge of this topic. Thanks to Schuyler Erle and David Lents for letting me bounce ideas off of them. Lastly, many thanks to Sarah Burcham for her contributions to Section 1.11 and for providing the inspiration and opportunity to work and write for O'Reilly.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
About This Book
This book starts with a general introduction to regular expressions. The first section of this book describes and defines the constructs used in regular expressions and establishes the common principles of pattern matching. The remaining sections of the book are devoted to the syntax, features, and usage of regular expressions in various implementations.
The implementations covered in this book are Perl, Java, .NET and C#, Python, PCRE, PHP, the vi editor, JavaScript, and shell tools.
The following typographical conventions are used in this book:
Italic
Used for emphasis, new terms, program names, and URLs
Constant width
Used for options, values, code fragments, and any text that should be typed literally
Constant width italic
Used for text that should be replaced with user-supplied values
The world of regular expressions is complex and filled with nuance. Jeffrey Friedl has written the definitive work on the subject, Mastering Regular Expressions (O'Reilly), a work on which I relied heavily when writing this book. As a convenience, this book provides page references to Mastering Regular Expressions, Second Edition (MRE) for expanded discussion of regular expression syntax and concepts.
This book simply would not have been written if Jeffrey Friedl had not blazed a trail ahead of me. Additionally, I owe him many thanks for allowing me to reuse the structure of his book and for his suggestions on improving this book. Nat Torkington's early guidance raised the bar for this book. Philip Hazel, Ron Hitchens, A.M. Kuchling, and Brad Merrill reviewed individual chapters. Linda Mui saved my sanity and this book. Tim Allwine's constant regex questions helped me solidify my knowledge of this topic. Thanks to Schuyler Erle and David Lents for letting me bounce ideas off of them. Lastly, many thanks to Sarah Burcham for her contributions to Section 1.11 and for providing the inspiration and opportunity to work and write for O'Reilly.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Introduction to Regexes and Pattern Matching
A regular expression is a string containing a combination of normal characters and special metacharacters or metasequences. The normal characters match themselves. Metacharacters and metasequences are characters or sequences of characters that represent ideas such as quantity, locations, or types of characters. The list in Section 1.2.1 shows the most common metacharacters and metasequences in the regular expression world. Later sections list the availability of and syntax for supported metacharacters for particular implementations of regular expressions.
Pattern matching consists of finding a section of text that is described (matched) by a regular expression. The underlying code that searchs the text is the regular expression engine. You can guess the results of most matches by keeping two rules in mind:
  1. The earliest (leftmost) match wins
    Regular expressions are applied to the input starting at the first character and proceeding toward the last. As soon as the regular expression engine finds a match, it returns. (See MRE 148-149, 177-179.)
  2. Standard quantifiers are greedy
    Quantifiers specify how many times something can be repeated. The standard quantifiers attempt to match as many times as possible. They settle for less than the maximum only if this is necessary for the success of the match. The process of giving up characters and trying less-greedy matches is called backtracking. (See MRE 151-153.)
Regular expression engines have subtle differences based on their type. There are two classes of engines: Deterministic Finite Automaton (DFA) and Nondeterministic Finite Automaton (NFA). DFAs are faster but lack many of the features of an NFA, such as capturing, lookaround, and non-greedy quantifiers. In the NFA world there are two types: Traditional and POSIX.
DFA engines
DFAs compare each character of the input string to the regular expression, keeping track of all matches in progress. Since each character is examined at most once, the DFA engine is the fastest. One additional rule to remember with DFAs is that the alternation metasequence is greedy. When more than one option in an alternation (
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Perl 5.8
Perl provides a rich set of regular-expression operators, constructs, and features, with more being added in each new release. Perl uses a Traditional NFA match engine. For an explanation of the rules behind an NFA engine, see Section 1.2.
This reference covers Perl Version 5.8. Unicode features were introduced in 5.6, but did not stabilize until 5.8. Most other features work in Versions 5.004 and later.
Perl supports the metacharacters and metasequences listed in Table 1-3 through Table 1-7. For expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-3: Character representations
Sequence
Meaning
\a
Alert (bell).
\b
Backspace; supported only in character class.
\e
ESC character, x1B.
\n
Newline; x0A on Unix and Windows, x0D on Mac OS 9.
\r
Carriage return; x0D on Unix and Windows, x0A on Mac OS 9.
\f
Form feed, x0C.
\t
Horizontal tab, x09.
\octal
Character specified by a two- or three-digit octal code.
\xhex
Character specified by a one- or two-digit hexadecimal code.
\x{hex}
Character specified by any hexadecimal code.
\cchar
Named control character.
\N{name}
A named character specified in the Unicode standard or listed in PATH_TO_PERLLIB/unicode/Names.txt. Requires use charnames ':full'.
Table 1-4: Character classes and class-like constructs (continued)
Class
Meaning
[...]
A single character listed or contained in a listed range.
[^...]
A single character not listed and not contained within a listed range.
[:class:]
POSIX-style character class valid only within a regex character class.
.
Any character except newline (unless single-line mode, /s).
\C
One byte; however, this may corrupt a Unicode character stream.
\X
Base character followed by any number of Unicode combining characters.
\w
Word character, \p{IsWord}.
\W
Non-word character ,\P{IsWord}.
\d
Digit character, \p{IsDigit}.
\D
Non-digit character, \P{IsDigit}.
\s
Whitespace character, \p{IsSpace}.
\S
Non-whitespace character,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Java (java.util.regex)
Java 1.4 supports regular expressions with Sun's java.util.regex package. Although there are competing packages available for previous versions of Java, Sun is poised to become the standard. Sun's package uses a Traditional NFA match engine. For an explanation of the rules behind a Traditional NFA engine, see Section 1.2.
java.util.regex supports the metacharacters and metasequences listed in Table 1-10 through Table 1-14. For expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-10: Character representations
Sequence
Meaning
\a
Alert (bell).
\b
Backspace, x08, supported only in character class.
\e
ESC character, x1B.
\n
Newline, x0A.
\r
Carriage return, x0D.
\f
Form feed, x0C.
\t
Horizontal tab, x09.
\0octal
Character specified by a one-, two-, or three-digit octal code.
\xhex
Character specified by a two-digit hexadecimal code.
\uhex
Unicode character specified by a four-digit hexadecimal code.
\cchar
Named control character.
Table 1-11: Character classes and class-like constructs
Class
Meaning
[...]
A single character listed or contained in a listed range.
[^...]
A single character not listed and not contained within a listed range.
.
Any character, except a line terminator (unless DOTALL mode).
\w
Word character, [a-zA-Z0-9_].
\W
Non-word character, [^a-zA-Z0-9_].
\d
Digit, [0-9].
\D
Non-digit, [^0-9].
\s
Whitespace character, [ \t\n\f\r\x0B].
\S
Non-whitespace character, [^ \t\n\f\r\x0B].
\p{prop}
Character contained by given POSIX character class, Unicode property, or Unicode block.
\P{prop}
Character not contained by given POSIX character class, Unicode property, or Unicode block.
Table 1-12: Anchors and other zero-width tests
Sequence
Meaning
^
Start of string, or after any newline if in MULTILINE mode.
\A
Beginning of string, in any match mode.
$
End of string, or before any newline if in
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
.NET and C#
Microsoft's .NET framework provides a consistent and powerful set of regular expression classes for all .NET implementations. The following sections list the .NET regular expression syntax, the core .NET classes, and C# examples. Microsoft's .NET uses a Traditional NFA match engine. For an explanation of the rules behind a Traditional NFA engine, see Section 1.2.
.NET supports the metacharacters and metasequences listed in Table 1-15 through Table 1-8. For expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-15: Character representations
Sequence
Meaning
\a
Alert (bell), x07.
\b
Backspace, x08, supported only in character class.
\e
ESC character, x1B.
\n
Newline, x0A.
\r
Carriage return, x0D.
\f
Form feed, x0C.
\t
Horizontal tab, x09.
\v
Vertical tab, x0B.
\0octal
Character specified by a two-digit octal code.
\xhex
Character specified by a two-digit hexadecimal code.
\uhex
Character specified by a four-digit hexadecimal code.
\cchar
Named control character.
Table 1-16: Character classes and class-like constructs
Class
Meaning
[...]
A single character listed or contained within a listed range.
[^...]
A single character not listed and not contained within a listed range.
.
Any character, except a line terminator (unless single-line mode, s).
\w
Word character, [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}] or [a-zA-Z_0-9] in ECMAScript mode.
\W
Non-word character, [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}] or [^a-zA-Z_0-9] in ECMAScript mode.
\d
Digit, \p{Nd} or [0-9] in ECMAScript mode.
\D
Non-digit, \P{Nd} or [^0-9] in ECMAScript mode.
\s
Whitespace character, [ \f\n\r\t\v\x85\p{Z}] or [ \f\n\r\t\v] in ECMAScript mode.
\S
Non-whitespace character, [^ \f\n\r\t\v\x85\p{Z}] or [^ \f\n\r\t\v] in ECMAScript mode.
\p{prop}
Character contained by given Unicode block or property.
\P{prop}
Character not contained by given Unicode block or property.
Table 1-17:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Python
Python provides a rich, Perl-like regular expression syntax in the re module. The re module uses a Traditional NFA match engine. For an explanation of the rules behind an NFA engine, see Section 1.2.
This chapter covers the version of re included with Python 2.2, although the module has been available in similar form since Python 1.5.
The re module supports the metacharacters and metasequences listed in Table 1-21 through Table 1-25. For expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-21: Character representations
Sequence
Meaning
\a
Alert (bell), x07.
\b
Backspace, x08, supported only in character class.
\n
Newline, x0A.
\r
Carriage return, x0D.
\f
Form feed, x0C.
\t
Horizontal tab, x09.
\v
Vertical tab, x0B.
\octal
Character specified by up to three octal digits.
\xhh
Character specified by a two-digit hexadecimal code.
\uhhhh
Character specified by a four-digit hexadecimal code.
\Uhhhhhhhh
Character specified by an eight-digit hexadecimal code.
Table 1-22: Character classes and class-like constructs
Class
Meaning
[...]
Any character listed or contained within a listed range.
[^...]
Any character that is not listed and is not contained within a listed range.
.
Any character, except a newline (unless DOTALL mode).
\w
Word character, [a-zA-z0-9_] (unless LOCALE or UNICODE mode).
\W
Non-word character, [^a-zA-z0-9_] (unless LOCALE or UNICODE mode).
\d
Digit character, [0-9].
\D
Non-digit character, [^0-9].
\s
Whitespace character, [ \t\n\r\f\v].
\S
Nonwhitespace character, [ \t\n\r\f\v].
Table 1-23: Anchors and zero-width tests
Sequence
Meaning
^
Start of string, or after any newline if in MULTILINE match mode.
\A
Start of search string, in all match modes.
$
End of search string or before a string-ending newline, or before any newline in MULTILINE match mode.
\Z
End of string or before a string-ending newline, in any match mode.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
PCRE Lib
The Perl Compatible Regular Expression (PCRE) library is a free-for-any-use, open source regular expression library developed by Philip Hazel. PCRE has been incorporated into PHP, Apache 2.0, KDE, Exim MTA, Analog, and Postfix. Users of those programs can use the supported metacharacters listed in Table 1-26 through Table 1-30.
The PCRE library uses a Traditional NFA match engine. For an explanation of the rules behind an NFA engine, see Section 1.2.
This reference covers PCRE Version 4.0, which aims to emulate Perl 5.8-style regular expressions.
PCRE supports the metacharacters and metasequences listed in Table 1-26 through Table 1-30. For expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-26: Character representations
Sequence
Meaning
\a
Alert (bell), x07.
\b
Backspace, x08, supported only in character class.
\e
ESC character, x1B.
\n
Newline, x0A.
\r
Carriage return, x0D.
\f
Form feed, x0C.
\t
Horizontal tab, x09.
\octal
Character specified by a three-digit octal code.
\xhex
Character specified by a one- or two-digit hexadecimal code.
\x{hex}
Character specified by any hexadecimal code.
\cchar
Named control character.
Table 1-27: Character classes and class-like constructs
Class
Meaning
[...]
A single character listed or contained in a listed range.
[^...]
A single character not listed and not contained within a listed range.
[:class:]
POSIX-style character class valid only within a regex character class.
.
Any character except newline (unless single-line mode, /s).
\C
One byte; however, this may corrupt a Unicode character stream.
\w
Word character, [a-zA-z0-9_].
\W
Non-word character, [^a-zA-z0-9_].
\d
Digit character, [0-9].
\D
Non-digit character, [^0-9].
\s
Whitespace character, [\n\r\f\t ].
\S
Non-whitespace character, [^\n\r\f\t ].
Table 1-28: Anchors and zero-width tests
Sequence
Meaning
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
PHP
This reference covers PHP 4.3's Perl-style regular expression support contained within the preg routines. PHP also provides POSIX-style regular expressions, but these do not offer additional benefit in power or speed. The preg routines use a Traditional NFA match engine. For an explanation of the rules behind an NFA engine, see Section 1.2.
PHP supports the metacharacters and metasequences listed in Table 1-31 through Table 1-35. For expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-31: Character representations
Sequence
Meaning
\a
Alert (bell), x07.
\b
Backspace, x08, supported only in character class.
\e
ESC character, x1B.
\n
Newline, x0A.
\r
Carriage return, x0D.
\f
Form feed, x0C.
\t
Horizontal tab, x09
\octal
Character specified by a three-digit octal code.
\xhex
Character specified by a one- or two-digit hexadecimal code.
\x{hex}
Character specified by any hexadecimal code.
\cchar
Named control character.
Table 1-32: Character classes and class-like constructs
Class
Meaning
[...]
A single character listed or contained within a listed range.
[^...]
A single character not listed and not contained within a listed range.
[:class:]
POSIX-style character class valid only within a regex character class.
.
Any character except newline (unless single-line mode,/s).
\C
One byte; however, this may corrupt a Unicode character stream.
\w
Word character, [a-zA-z0-9_].
\W
Non-word character, [^a-zA-z0-9_].
\d
Digit character, [0-9].
\D
Non-digit character, [^0-9].
\s
Whitespace character, [\n\r\f\t ].
\S
Non-whitespace character, [^\n\r\f\t ].
Table 1-33: Anchors and zero-width tests
Sequence
Meaning
^
Start of string, or after any newline if in multiline match mode, /m.
\A
Start of search string, in all match modes.
$
End of search string or before a string-ending newline, or before any newline if in multiline match mode,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
vi Editor
The vi program is a popular text editor on all Unix systems, and Vim is a popular vi clone with expanded regular expression support. Both use a DFA match engine. For an explanation of the rules behind a DFA engine, see Section 1.2.
Table 1-36 through Table 1-40 list the metacharacters and metasequences supported by vi. For expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-36: Character representation
Sequence
Meaning
Vim only
\b
Backspace, x08.
\e
ESC character, x1B.
\n
Newline, x0A.
\r
Carriage return, x0D.
\t
Horizontal tab, x09.
Table 1-37: Character classes and class-like constructs
Class
Meaning
[...]
Any character listed or contained within a listed range.
[^...]
Any character that is not listed or contained within a listed range.
[:class:]
POSIX-style character class valid only within a character class.
.
Any character except newline (unless /s mode).
Vim only
\w
Word character, [a-zA-z0-9_].
\W
Non-word character, [^a-zA-z0-9_].
\a
Letter character, [a-zA-z].
\A
Non-letter character, [^a-zA-z].
\h
Head of word character, [a-zA-z_].
\H
Not the head of a word character, [^a-zA-z_].
\d
Digit character, [0-9].
\D
Non-digit character, [^0-9].
\s
Whitespace character, [ \t].
\S
Non-whitespace character, [^ \t].
\x
Hex digit, [a-fA-F0-9].
\X
Non-hex digit, [^a-fA-F0-9].
\o
Octal digit, [0-7].
\O
Non-octal digit, [^0-7].
\l
Lowercase letter, [a-z].
\L
Non-lowercase letter, [^a-z].
\u
Uppercase letter, [A-Z].
\U
Non-uppercase letter, [^A-Z].
\i
Identifier character defined by isident.
\I
Any non-digit identifier character.
\k
Keyword character defined by iskeyword, often set by language modes.
\K
Any non-digit keyword character.
\f
Filename character defined by isfname. Operating system dependent.
\F
Any non-digit filename character.
\p
Printable character defined by isprint, usually x20-x7E.
\P
Any non-digit printable character.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
JavaScript
JavaScript introduced Perl-like regular expression support with Version 1.2. This reference covers Version 1.5 as defined by the ECMA standard. Supporting implementations include Microsoft Internet Explorer 5.5+ and Netscape Navigator 6+. JavaScript uses a Traditional NFA match engine. For an explanation of the rules behind an NFA engine, see Section 1.2.
JavaScript supports the metacharacters and metasequences listed in Table 1-41 through Table 1-45. For expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-41: Character representations
Sequence
Meaning
\0
Null character, \x00.
\b
Backspace, \x08, supported only in character class.
\n
Newline, \x0A.
\r
Carriage return, \x0D.
\f
Form feed, \x0C.
\t
Horizontal tab, \x09.
\t
Vertical tab, \x0B.
\xhh
Character specified by a two-digit hexadecimal code.
\uhhhh
Character specified by a four-digit hexadecimal code.
\cchar
Named control character.
Table 1-42: Character classes and class-like constructs
Class
Meaning
[...]
A single character listed or contained within a listed range.
[^...]
A single character not listed and not contained within a listed range.
.
Any character except a line terminator, [^\x0A\x0D\u2028\u2029].
\w
Word character, [a-zA-Z0-9_].
\W
Non-word character, [^a-zA-Z0-9_].
\d
Digit character, [0-9].
\D
Non-digit character, [^0-9].
\s
Whitespace character.
\S
Non-whitespace character.
Table 1-43: Anchors and other zero-width tests
Sequence
Meaning
^
Start of string, or after any newline if in multiline match mode, /m.
$
End of search string or before a string-ending newline, or before any newline if in multiline match mode, /m.
\b
Word boundary.
\B
Not-word-boundary.
(?=...)
Positive lookahead.
(?!...)
Negative lookahead.
Table 1-44: Mode modifiers
Modifier
Meaning
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Shell Tools
awk, sed, and egrep are a related set of Unix shell tools for text processing. awk and egrep use a DFA match engine, and sed uses an NFA engine. For an explanation of the rules behind these engines, see Section 1.2.
This reference covers GNU egrep 2.4.2, a program for searching lines of text; GNU sed 3.02, a tool for scripting editing commands; and GNU awk 3.1, a programming language for text processing.
awk, egrep, and sed support the metacharacters and metasequences listed in Table 1-46 through Table 1-50. For expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-46: Character representations
Sequence
Meaning
Tool
\a
Alert (bell).
awk, sed
\b
Backspace; supported only in character class.
awk
\f
Form feed.
awk, sed
\n
Newline (line feed).
awk, sed
\r
Carriage return.
awk, sed
\t
Horizontal tab.
awk, sed
\v
Vertical tab.
awk, sed
\ooctal
A character specified by a one-, two-, or three-digit octal code.
sed
\octal
A character specified by a one-, two-, or three-digit octal code.
awk
\xhex
A character specified by a two-digit hexadecimal code.
awk, sed
\ddecimal
A character specified by a one, two, or three decimal code.
awk, sed
\cchar
A named control character (e.g., \cC is Control-C).
awk, sed
\b
Backspace.
awk
\metacharacter
Escape the metacharacter so that it literally represents itself.
awk, sed, egrep
Table 1-47: Character classes and class-like constructs
Class
Meaning
Tool
[...]
Matches any single character listed or contained within a listed range.
awk, sed, egrep
[^...]
Matches any single character that is not listed or contained within a listed range.
awk, sed, egrep
.
Matches any single character, except newline.
awk, sed, egrep
\w
Matches an ASCII word character, [a-zA-Z0-9_].
egrep, sed
\W
Matches a character that is not an ASCII word character, [^a-zA-Z0-9_].
egrep, sed
[:prop:]
Matches any character in the POSIX character class.
awk, sed
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Return to Regular Expression Pocket Reference