In this
chapter, we’re going to pretend
we’re working on a web site that was developed by someone
working in a Windows environment (where filenames are
case-insensitive, and HTML files typically get
.htm
filename extensions). Now, however, that site
has been moved to a Unix web server, and you wish to make the
following changes:
Rename all the
.htm
files so that the filenames end with.html
.Change the filenames (some of which feature uppercase letters) so that they are uniformly lowercase.
Modify all the
HREF
attributes contained in those pages so that they match your changes to the filenames.
If you had only a few files to deal with, you could just do all this
manually. Filenames could be changed one at a time using the Unix
mv
(for “move”) command, which has the
effect of renaming the file whose name is given in its first argument
to the name given in its second argument:
[jbc@andros testsite]$ mv Index.HTM index.html
You could then edit the HREF
attributes of each
file in a text editor, changing <A
HREF="Index.HTM">
to <A HREF="index.html">
. And so on.
But what if you have a lot of files that you want to manipulate? At a certain point, the effort of manually making all those changes (and policing the errors that will inevitably creep in as you grind your way through this boring task) is going to be less than the effort of writing a tool to make the changes for you. At this early stage in your education that break-even point will come later (since creating the tool will be a slower process than it will be later on), but you can think of the effort involved as an educational investment that will pay you back many times over in future productivity gains. The plodding, manual approach offers no such promise of future rewards.
In any event, let’s get started.
Here’s
an
ls
listing of a
directory containing some of those mixed-case filenames:
[jbc@andros testsite]$ ls
Clinton.JPG Hello_CGI.htm Sample_Form.htm
Form_to_Email.HTM Hello_Command.HTM guestbook_email.htm
Guestbook.HTM NEXT.HTM index.htm
The first thing to do is figure out how you’re going to feed
all those filenames to your Perl script so that it can do its
modifying. There are many ways to do that, but the method
you’re going to use here is a feature of the
Unix shell called pathname
expansion
, or, more colloquially,
globbing
.
If you come from a Windows background, you’re probably familiar
with the use of an
asterisk (*
,
pronounced “star”) as a wildcard
character when specifying a filename. The asterisk stands for
“any number of characters, including no characters” when
specifying a filename. This legacy of the DOS command-line
environment shows up in dialog boxes’
Filename
fields, where you sometimes see things
like *.doc
to represent “all filenames
ending with the characters .doc
“, or
*.*
to represent “all filenames
whatsoever.”
DOS/Windows filenames are divided into two parts: the filename itself
and a three-character extension, with the period character
(.
, pronounced “dot”) serving as the
separator between the two parts. Unix filenames don’t feature
the notion of a filename extension, at least not in the formal sense
that DOS filenames do. You’re free to stick a
.plx
or .txt
or
.walnuts
on the end of your Unix filenames, but
the operating system doesn’t care that you’ve done so.
You can also stick multiple periods in a filename, so you could have
a file called this.filename.has.lots.of.dots
.
Why am I carrying on about this? For only one reason: in DOS or
Windows, the wildcard sequence that allows you to specify
“every filename whatsoever” is *.*
. In
Unix, it’s just *
. Let’s try it out
now.
Tip
The statement that *
stands for “every
filename whatsoever” in Unix and Unix-like systems isn’t
quite accurate. Files whose names begin with a single period are
“hidden,” and won’t be matched by a
*
wildcard sequence. To match those names, you
would need something like .*
(“dot
star”).
You may not have realized it before this, but you can supply a
filename as an argument to the ls
command, in
which case ls
will list information about that
file only:
[jbc@andros testsite]$ ls index.htm
index.htm
In fact, you can give a whole bunch of filenames, and
ls
will dutifully list just those filenames (or,
more precisely, just those of the names that correspond to files in
the current directory):
[jbc@andros testsite]$ ls index.htm Hello_Command.HTM guestbook_email.htm
Hello_Command.HTM guestbook_email.htm index.htm
Now, the tricky and cool part is that you can use an asterisk as a
wildcard character, and it will be interpreted as “any
character, or any number of characters, including no characters, that
will match a filename in the current directory.” So, as
mentioned before, a *
all by itself means
“match any filename in the current directory at all”
(except those starting with dots):
[jbc@andros testsite]$ ls *
Clinton.JPG Hello_CGI.htm Sample_Form.htm
Form_to_Email.HTM Hello_Command.HTM guestbook_email.htm
Guestbook.HTM NEXT.HTM index.htm
That’s the same output we got with ls
all by
itself because ls
’s default behavior is to
list all the files in the current directory, and that’s
(almost) the same thing as saying ls *
.
Tip
I said “almost” in the preceding sentence because of
subdirectories. In this example no subdirectories were contained
within the current directory. If there had been one or more
subdirectories, and if those subdirectories’ names did not
begin with a dot (.
), the output of ls *
would have included the contents of those directories,
which invoking ls
by itself would
not have done. That happens because the
*
matches the names of those subdirectories, which
means the ls
command would have received those
subdirectory names as explicit command-line arguments, and when
ls
gets a subdirectory name as an argument, it
displays the contents of that directory. If that sounded confusing,
just ignore it until later, when it will make more sense.
But let’s say we want to list only the filenames ending in
.htm
. We can just do this:
[jbc@andros testsite]$ ls *.htm
Hello_CGI.htm Sample_Form.htm guestbook_email.htm index.htm
Hmm. There’s that Unix
case-sensitivity thing again: we
only got the files with lowercase .htm
filename
endings. But what about those uppercase .HTM
files? How can we list those along with the .htm
ones? Well, one easy way to do it is by just adding
*.HTM
to the command’s arguments, like this:
[jbc@andros testsite]$ ls *.htm *.HTM
Form_to_Email.HTM Hello_CGI.htm NEXT.HTM guestbook_email.htm
Guestbook.HTM Hello_Command.HTM Sample_Form.htm index.htm
If you think this wildcard-expansion thing is fun, see More Fun with Shell Expansion for even niftier tricks.
Now, a subtle but potentially very powerful point about all this is
that it isn’t actually the ls
command that
is expanding that *
into a list of matching
filenames. In fact, it’s the shell that is doing so, and it is
only after the shell has done the expansion that it hands off that
list of filenames as the argument to ls
. In other
words, the ls
command never sees the literal star
(*
) character. It only sees the list of filenames
that are the result of the shell’s expansion of
*
. This is powerful because you are not limited to
using filename expansion only in the arguments to the
ls
command. You can also use it in the arguments
to any command, including your own custom Perl
programs.
In still other words, you can use the shell’s wildcard expansion as a convenient, flexible way to hand off a list of specific filenames to your Perl program for processing.
Let’s see a Perl program that renames all the files for you. We’ll build the script from scratch, modifying things as we go along; see Example 4-1 for the final script:
#!/usr/bin/perl -w # rename.plx - rename files so they end in '.html' foreach $file (@ARGV) { print "got $file\n"; }
Let’s look at this line by line. The first line is just the
usual shebang line, with
warnings turned on via the
-w
switch. As mentioned before, if you are using a
Perl version equal to or later than 5.6.0, you can do the same thing
with a use warnings
statement at the beginning of
your script. Next is a comment giving the name of the script and a
brief description of what it does (or will do, once we’re done
creating it).
The next three lines contain a
foreach
loop. A foreach
loop, you will recall, processes each element of an array variable or
list, sticking the current item into the scalar variable whose name
is given between the foreach
keyword and the list,
so that item can be accessed during the current trip through the
loop.
In this case, the array being processed by the
foreach
loop is the special array
@ARGV
. What is the
@ARGV
array variable,
you ask? Well, it turns out to be something special: every script
gets it automatically every time it runs, and it contains a list of
whatever words came after the script’s name on the command
line. We call these additional words arguments
.
So this foreach
loop will run once for each of the
script’s arguments, storing the current argument in the
variable $file
and printing out each element in
@ARGV
via the print "got $file\n"
statement. (Later, we’ll stick the code for
renaming the files inside this loop. But for right now, we’ll
print a message just to inform us that the foreach
loop works.)
Now, remember that the argument list in @ARGV
will
reflect the result of wildcard expansion by the shell. So, running
rename.plx
in the directory from our earlier
example and giving it an argument of *.htm
results
in the following output:
[jbc@andros testsite]$ rename.plx *.htm
got Hello_CGI.htm
got Sample_Form.htm
got guestbook_email.htm
got index.htm
Likewise, running it with an argument of *.htm *.HTM
gives this:
[jbc@andros testsite]$ rename.plx *.htm *.HTM
got Hello_CGI.htm
got Sample_Form.htm
got guestbook_email.htm
got index.htm
got Form_to_Email.HTM
got Guestbook.HTM
got Hello_Command.HTM
got NEXT.HTM
Now that we’ve seen how to feed a list of filenames to a script
and run a foreach
loop that processes each
filename, here’s a modified version of
rename.plx
that actually renames files. (If you
are creating this script yourself, though, please don’t run it
yet; we still need to add some safety features to it.)
#!/usr/bin/perl -w # rename.plx - rename files so they end in '.html' foreach $file (@ARGV) { $new = lc $file; $new = $new . 'l'; rename $file, $new or die "couldn't rename $file to $new: $!"; }
The script is the same, except for the part inside the
foreach
loop’s block. Let’s look at
that line by line.
First comes this:
$new = lc $file;
This takes the name of the current file being processed through the
foreach
loop, makes a lowercase version of it
using Perl’s lc
function, and assigns that
lowercase filename to a new scalar variable called
$new
.
$new = $new . 'l';
The next line takes that new, lowercase version of the filename and
adds a lowercase letter l
to the end of it, using
the
.
(“dot”)
operator, which is also called the string concatenation
operator
because it joins, or
concatenates
, the string on its left with the
string on its right, returning the concatenated string.
Because all we’re doing with that concatenated string is
storing it back into the $new
variable, we can
actually write this line a little more concisely using the special
operator .=
(which I guess you could pronounce
“dot equals”). The .=
operator has the
effect of appending a string to the string currently stored in a
variable, and then sticking the concatenated string back into that
variable:
$new .= 'l';
Shortening $new = $new . 'l'
into $new .= 'l'
is a bit like using a contraction when speaking (e.g.,
saying “won’t” instead of “will not”),
and is one of those natural-language-inspired shortcuts in Perl.
Next comes the line that does the actual work:
rename $file, $new or die "couldn't rename $file to $new: $!";
Here, Perl’s rename
function is used to take
the file named by $file
and rename it to
$new
. If that rename
operation
fails, the or die
part of the line kicks in,
terminating the script and printing an error message that includes
$!
, the special variable containing the error
message returned by the system when the operation failed. If the
rename
function succeeds, everything after the
or
gets skipped, so the script continues happily
to the next pass through the foreach
block.
Your assembly line appears to be ready to go. If you’re impatient, you’re probably anxious to run your script right now. Resist that impulse. This is the time to look things over carefully with a pessimistic eye, asking yourself what could possibly go wrong and trying to prevent any nasty accidents. Measure twice, cut once, and all that. Swiss Army chainsaw. Hole Hawg.
One potential concern with this script is that the assembly line
doesn’t have a quality control inspector. Every item mentioned
in @ARGV
gets an l
appended to
it, and then the script tries to rename a file from the old name to
the new one. Now, we already discussed how you were planning to run
this script with a carefully crafted argument of *.htm *.HTM
, which would be expanded by the shell into a list of
just the files you wanted, but consider what would happen if you
accidentally invoked the script like this:
[jbc@andros testsite]$ rename.plx *
You could accidentally glob up files other than the ones you wanted,
appending l
’s to their names, too. Bad idea.
The answer (one answer, at least) is to put some new code in the
foreach
block that skips to the next file without
doing the renaming if something doesn’t look right. This is a
simple example of a common programming practice called a
sanity check
.
For example, you might want to add a sanity check that excludes files from being renamed if an existing file already has the new name. You could do that by adding the following code just before the line where you rename the file:
if (-e $new) { warn "$new already exists. Skipping...\n"; next; }
This uses Perl’s -e
file test operator, which returns true
if the filename given after it corresponds to an existing file. In
this case, that means it returns true if $new
(which contains the version of the filename with the
`l
' added to the end) already exists.
If that happens, this if
block will execute,
causing a warning to be printed via the
warn
function. The
warn
function is similar to the
die
function, in that it causes your
script to complain to
standard
error (printing a message to your screen in the case of a script run
manually, or to the web server’s error log for a CGI script).
Unlike the die
function, though,
warn
lets your script continue running after that
point.
After issuing the warning, the if
block uses the
next
function to make your script jump immediately
to the next item in the foreach
loop, without
executing the rest of the statements in the loop. In other words, it
causes the script to skip the rename
operation for
this file.
Another sanity check would skip files that were anything other than
“plain” files. This would prevent the script from
renaming a file that actually was a directory, for example, or a
symbolic link
. (In Unix, a symbolic link is a special
file that actually just points to some other file.) Here’s how
you could implement that sanity check:
unless (-f $file) { warn "$file is not a plain file. Skipping...\n"; next; }
This check uses Perl’s -f
file test operator, which returns true
if the filename given in its argument corresponds to a plain file. We
used unless
here instead of if
because we wanted to reverse the sense of the logical test. In other
words, we wanted to execute the statements in the block only if the
conditional test returned false rather than true. This is precisely
what you get with unless
.
We’re on quite a roll with these sanity checks, but let’s
add two more before we stop. First, let’s prevent files from
being renamed if the original name doesn’t end in
.htm
(or .HTM
, or any other
case-insensitive variation of that three-letter filename extension).
Also, let’s prevent files from being renamed if their names
contain forward slashes. That way, at least on a Unix system (where
forward slashes are used to separate the directory names in a path),
the renaming will be confined to the current directory. To add these
features, insert the following before the rename:
unless ($file =~ /\.htm$/i) { warn "$file doesn't end in .htm or .HTM. Skipping...\n"; next; } if ($file =~ /\//) { warn "$file contains a slash. Skipping...\n"; next; }
These logical tests are very interesting. Both of them use the
=~
operator to tie the $file
variable to a pattern
matching operator based on Perl’s regular expressions feature.
In each case, the pattern matching operator checks the name stored in
the $file
variable to see if it matches a
particular search pattern, and returns a true value if it does or a
false value if it doesn’t.
Before continuing, let’s talk about regular expressions and the associated pattern matching operators a bit more.
Regular expressions
(which you’ll sometimes hear me
refer to as regexes
) are extremely powerful. For
a beginning programmer, though, they’re almost
too powerful; they can seem weird and scary and
needlessly complicated. Still, you need to stick with them because
they’re important.
As I’ve said, regular expressions are a tool for matching (and, potentially, replacing) specific patterns in a string of text. If you’ve used the “Find” or “Search and Replace” function in a word processor, you have some idea of what regular expressions do, but Perl’s regular expressions are much more powerful than that. Their rich (that is to say, confusingly complex) syntax allows you to specify with astonishing precision exactly what patterns you are looking for and what you want done to them.
In later chapters I’ll be explaining more about regular expressions. For now, let’s just look at a few examples to get an idea of how they work.
We’ll
start with the one that looks for filename
extensions like .htm
(or .HTM
,
etc.). The whole expression looks like this:
/\.htm$/i
. The first thing you need to be able to
do is break the expression down into its component parts. As Figure 4-1 shows, there are four different parts to this
expression: the opening delimiter (/
), the search
pattern (\.htm$
), the closing delimiter
(/
), and an optional modifier (i
).
The
delimiters are pretty straightforward: a
slash to mark the beginning of the expression, and a slash to mark
the end. The trailing modifier is easy to understand, too: this
particular modifier (which you’ll typically see referred to as
the /i
modifier) simply makes the expression match
case-insensitively.
It’s the regular expression pattern itself, the part between
the delimiters, where the powerful magic hangs out. Regular
expression patterns use their own specialized language, with lots of
special rules and symbols. This pattern is actually fairly simple:
\.htm$
. Let’s go through it piece by piece,
from left to right.
First, the leading backslash-plus-a-period (\.
)
matches a literal period character. That should give you a hint: a
period without a leading backslash does
something special in a regular expression. I’m not going to
tell you what that something special is until later, because
I’d rather you used that part of your brain to remember the
following helpful rule about regular expression patterns. An
alphanumeric
character (the characters A
through
Z
, a
through
z
, and 0
through
9
) always just stands for itself. A
nonalphanumeric character, though, can sometimes mean something
special. About a dozen of these nonalphanumeric characters have
special meanings inside a regex; I’ll be introducing them as we
go along.
Stick a backslash (\
) in front of a
nonalphanumeric character in your regex, though, and that special
character will always revert to having its ordinary, literal meaning
for matching purposes. That’s what we’ve done in this
pattern: we wanted to match a literal period, so we put a backslash
in front of it.
The next three characters (htm
) just match
themselves. That is, they will match the literal characters
h
, and t
, and
m
, one right after the other, in that order. Also,
because of that trailing /i
modifier, each will
also match the uppercase version of itself, such that
HTM
(and hTM
, and
Htm
, etc.) would all match, too.
Alphanumeric characters work in the opposite way from nonalphanumeric
characters. What I mean is, an alphanumeric character always stands
for itself, unless you put a backslash in front of it, in which case
it gets some special meaning (like \n
, which gives
you a newline in a regex pattern, just like it does in a
double-quoted string).
All of which brings us to the last thing in this pattern: the
trailing $
. It’s not an alphanumeric
character, and it doesn’t have a leading backslash, so that
should give you a hint that it might be doing something special. And
in fact it is: when a
dollar sign ($
) is used at
the very end of a regular expression pattern, it means that the
pattern that precedes it can match only if it occurs at the end of
the string. In other words, the $
doesn’t
match anything itself, but it makes it so that the rest of the
pattern can match only if it comes at the very end of the string
being matched against.
So, in this particular example, our pattern will match a string only
if that string ends with the literal sequence .htm
(or .HTM
, .HtM
, or whatever). A
string like this: `this
string has an .htm, but not at the end
' would
not produce a match with this particular pattern
(but take out the $
at the end of the pattern, and
it would).
Now let’s look at the regex in the don’t-allow-any-slashes
sanity check: /\//
. This expression is actually a
good deal simpler than the first one. There’s just an opening
delimiter (/
), the pattern itself
(\/
), and the closing delimiter
(/
). The pattern itself just matches a literal
slash character, courtesy of the backslash in front of it. Without
that backslash, Perl would think the slash in the pattern was
actually the closing delimiter.
But for a simple pattern it sure looks
confusing. The slash character doesn’t have any special meaning
in the regex pattern itself; it only has to be backslashed because of
its role as the pattern’s delimiter. It would be really nice if
there was a way to use some other character to delimit the search
pattern in this case, so we didn’t have to backslash the slash.
And, as it turns out, there is a way to do that: put an
m
(for “matching operator”) in front
of the expression, and then choose whatever we want for the
delimiter. So, for example, that same regex could have been written
as m#/#
, or m|/|
, either of
which is arguably more readable than the original version. We also
could choose a paired delimiter, like parentheses or braces, in which
case the closing delimiter would be the closing member of the pair:
m{/}
. That one’s my personal favorite, so
let’s update the code in fix_links.plx
to
use that version.
Summing up, the first of our regex-using sanity checks, which begins with this line:
unless ($file =~ /\.htm$/i) {
will fire off only if the filename in $file
fails
to end in the literal string .htm
(or
.HTM
, etc.). The second of our regex-using sanity
checks, which now begins with this line:
if ($file =~ m{/}) {
will fire off only if the filename in $file
contains a slash character.
We
could go on adding sanity checks all day,
but I think we’ve been sufficiently paranoid for now. Now that
the rename.plx
script is finished, it should look
like Example 4-1 (which you can download from this
book’s script repository, at http://www.elanus.net/book/, if you want to
play around with it).
Example 4-1. A script for renaming listed files to have lowercase filenames ending in .html
#!/usr/bin/perl -w # rename.plx - rename files so they end in '.html' foreach $file (@ARGV) { $new = lc $file; $new .= 'l'; if (-e $new) { warn "$new already exists. Skipping...\n"; next; } unless (-f $file) { warn "$file is not a plain file. Skipping...\n"; next; } unless ($file =~ /\.htm$/i) { warn "$file doesn't end in .htm or .HTM. Skipping...\n"; next; } if ($file =~ m{/}) { warn "$file contains a slash. Skipping...\n"; next; } rename $file, $new or die "couldn't rename $file to $new: $!"; }
Running it in our directory full of wackily named files, and using
ls
to look at the filenames before and after,
results in the following:
[jbc@andros testsite]$ls
Clinton.JPG Hello_CGI.htm Sample_Form.htm rename.plx Form_to_Email.HTM Hello_Command.HTM guestbook_email.htm Guestbook.HTM NEXT.HTM index.htm [jbc@andros testsite]$rename.plx *.htm *.HTM
[jbc@andros testsite]$ls
Clinton.JPG guestbook_email.html index.html sample_form.html form_to_email.html hello_cgi.html next.html guestbook.html hello_command.html rename.plx
All the files whose names ended in .htm
or
.HTM
have been renamed so that their filenames are
uniformly lowercase and have .html
extensions.
Get Perl for Web Site Management now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.