Perl is a high-level language, so I don’t have to play with bits and bytes to get my job done. The trade-off, however, is that I have to let Perl manage how it stores everything. What if I want to control that? And what about the rest of the world that packs a lot of information into single bytes, such as Unix file permissions? Or what if my array of tens of thousands of numbers takes up too much memory? Falling back to working with the bits can help that.
Almost all of us deal with binary computers, even to the point that it seems redundant to say “binary.” When it gets down to the lowest levels, these deal with on or off, or what we’ve come to call 1 or 0. String enough of those 1s and 0s together, and I have the instructions to tell a computer to do something or the physical representation of data on a disk. And, although most of us don’t have to deal with computers at this level, some of this thinking has reached into high-level programming because we have to deal with lower levels at some point.
Consider, for instance, the arguments that I use with mkdir
, chmod
, or dbmopen
to set the file mode (also known as
permissions, but actually more than just that). Although I write the mode
as a single base-8 number, its meaning depends on its particular bit
pattern:
mkdir $dir, 0755; chmod 0644, @files; dbmopen %HASH, $db_file, 0644;
I also get the file mode as one of the return values from stat
:
my $file_mode = ( stat( $file ) )[2];
On Unix and Unix-like systems, that file mode packs in a lot of information, including the file permissions for the owner, group, and others, as well as bits for setuid, setgid, and some other stuff. Once I have it, I need to pick it apart. Perl has all of the necessary operators to do this, but I’ll get to those later.
In some cases I might want to use a sequence of bits to represent a series of values. By giving every bit (or group of bits) meaning, I can use a single scalar to store multiple values while only incurring the scalar variable memory overhead once. Since computers can be very fast at bit operations, my operations on strings of bits won’t be that slow, although the rest of the programming around this technique may slow things down. In Chapter 17, I’ll use a bit string to store a DNA strand. While the memory requirements of my program drop dramatically, I don’t see impressive speeds. Oh well; I’m always trading one benefit for another.
Since Perl 5.6, I can specify values directly in binary using
the 0b
notation. We
partially covered this in Chapter 2 of Learning
Perl:
my $value = 0b1; # same as 1, 0x01, 01 my $value = 0b10; # 2, 0x02, 02 my $value = 0b1000; # 8, 0x08, 010
I can use embedded underscores to make long binary values easier for me to read; Perl simply ignores them. A byte is a sequence of eight bits, and a nybble[54]is half of a byte:
my $value = 0b1010_0101 # by nybbles; my $value = 0b11110000_00001111 # by bytes my $value = 0b1111_0000___0000_1111 # by bytes and nybbles
Currently (and without hope for future inclusion), Perl does not
have a bin
built-in that acts like
oct
or hex
to interpret a number in a particular
base. I could write my own, and initially I did before Randal reminded
me that the oct
built-in handles
binary, octal, and hexidecimal conversions:
my $number = oct( "0b110" ); # 6
Of course, once I assign the value to the variable, Perl just
thinks of it as a number, which doesn’t have an inherent representation,
although Perl shows me values in the decimal representation unless I
specifically tell it to do something else. I can output values in binary
format with the %b
format specifier
to printf
or sprintf
. In this example, I preface the value
with the literal sequence 0b
just to
remind myself that I formatted the value in binary. All the ones and
zeros give me a hint, but other bases use those digits, too:
#!/usr/bin/perl my $value = 0b0011; printf "The value is 0b%b\n", $value;
In that example I had to prefix the value with 0b
myself. I can use a different sprintf
sequence to get it for free. By using
a hash symbol, #
, after the %
that starts the placeholder, Perl prefixes
the number with a string to indicate the base:[55]
my $number_string = printf '%#b', 12; # prints "0b1100"
I can get more fancy by specifying a width for the format. To
always get 32 places in my binary representation, I put that width
before the format specifier. I also add a leading 0 so that Perl fills
the empty columns with zeros. The literal 0b
adds two characters to the value, so the
total column width is 34:
printf "The value is 0b%034b\n", $value; printf "The value is %#034b\n", $value;
I use this sort of code quite a bit since I often need to convert
between bases, so I have some Perl one-liners to help me. I alias the
following one-liners to commands I can use in the bash shell (your shell
might do it differently). The d2h
converts from decimal to hexadecimal, the o2b
converts from octal to binary, and so on. These tiny scripts might come
in handy as you go through this chapter:
# for bash. your shell is probably different. alias d2h="perl -e 'printf qq|%X\n|, int( shift )'" alias d2o="perl -e 'printf qq|%o\n|, int( shift )'" alias d2b="perl -e 'printf qq|%b\n|, int( shift )'" alias h2d="perl -e 'printf qq|%d\n|, hex( shift )'" alias h2o="perl -e 'printf qq|%o\n|, hex( shift )'" alias h2b="perl -e 'printf qq|%b\n|, hex( shift )'" alias o2h="perl -e 'printf qq|%X\n|, oct( shift )'" alias o2d="perl -e 'printf qq|%d\n|, oct( shift )'" alias o2b="perl -e 'printf qq|%b\n|, oct( shift )'"
Perl’s binary operators do the same things they do in C and for the most part act like they do in the particular version of the C library with which your Perl was compiled. Whenever I need to work in binary and look something up I usually reach for my C book,[56]but mostly because that’s where I first learned it.
The unary NOT operator (sometimes called the complement operator), ~
, returns the bitwise negation, or 1’s
complement, of the value, based on integer size of the
architecture.[57]This means it doesn’t care what the sign of the numeric
value is; it just flips all the bits:
my $value = 0b1111_1111; my $complement = ~ $value; printf "Complement of\n\t%b\nis\n\t%b\n", $value, $complement;
I see that even though I gave it an 8-bit value, it comes back as a 32-bit value (because my MacBook has 32-bit integers):
Complement of 11111111 is 11111111111111111111111100000000
That’s not very nice output. I’d like the values to line up properly. To do that, I need the integer size. That’s easy enough to get from the Perl configuration, though (see Chapter 11). The integer size is in bytes, so I multiply by eight the value I get from Perl’s configuration:
#!/usr/bin/perl # complement.pl use Config; my $int_size = $Config{intsize} * 8; print "Int size is $int_size\n"; my $value = 0b1111_1111; my $complement = ~ $value; printf "Complement of\n\t%${int_size}b\nis\n\t%${int_size}b\n", $value, $complement;
Now my values line up properly, although I’d like it even more if I could see the leading zeros. You can figure that one out on your own:
Int size is 32 Complement of 11111111 is 11111111111111111111111100000000
I also have to be careful how I use the result I get from a unary
NOT. Depending on how I use it, I can get back different values. In the
next example I put the bitwise NOT value in $negated
. When I print $negated
with printf
, I see that I flipped all the bits, and
that the negative value is one greater in magnitude than the positive
one. That’s two’s complement thinking, and I won’t go into that here.
However, when I print the number with a plain ol’ print
, Perl treats it as an unsigned value, so
that bit flipping doesn’t do anything to the sign for the numbers that
started positive, and it makes negative numbers positive:
#!/usr/bin/perl # unary-not.pl foreach my $value ( 255, 128, 5, 65534 ) { my $negated = ~ $value; printf " value is %#034b %d\n", $value, $value; printf "~ value is %#034b %d\n", $negated, $negated; print " value is ", $negated, "\n\n"; }
This gives me output that can be confusing to those who don’t know what’s happening (which means that I shouldn’t use this liberally if I want the next programmer to be able to figure out what’s going on):
value is 0b00000000000000000000000011111111 255 ~ value is 0b11111111111111111111111100000000 -256 value is 4294967040 value is 0b00000000000000000000000010000000 128 ~ value is 0b11111111111111111111111101111111 -129 value is 4294967167 value is 0b00000000000000000000000000000101 5 ~ value is 0b11111111111111111111111111111010 -6 value is 4294967290 value is 0b00000000000000001111111111111110 65534 ~ value is 0b11111111111111110000000000000001 -65535 value is 4294901761
The ~
operator also lives near
the top of the precedence chart, so it’s going to do its work before
most other operators have a chance to do theirs. Be careful with
that.
What if I don’t want all of those bits in my previous examples? I’m
stuck with Perl’s integer size, but I can use a bit mask to get rid of
the excess, and that brings me to the next operator, bitwise AND,
&
.
The bitwise AND operator returns the bits set in both first and second arguments. If either value has a 0 in that position, the result has a zero in that position, too. Or, the result has a 1 in the same position only where both arguments have a 1. Usually the second argument is called a mask since its 0s hide those positions in the first argument:
1010 value & 1101 mask ------ 1000
This lets me select the parts of a bit field that interest me. In
the previous section, I used the ~
to
take the complement of an 8-bit value but got back a 32-bit value. If I
wanted only the last eight bits, I could use &
with a value that has the bits set in
only the lowest byte:
my $eight_bits_only = $complement & 0b1111_1111;
I can do this with the hexadecimal representation to make it
easier to read. The value 0xFF
represents a byte with all bits set, so I can use that as the mask to
hide everything but the lowest byte:
my $eight_bits_only = $complement & 0xFF;
This is also useful to select just the bits I need from a number.
For instance, the Unix file mode that I get back from stat
contains the owner, group, and other
permissions encoded into two bytes. Each of the permissions gets a
nybble, and the high nybble has various other information. To get the
permissions, I just have to know (and use) the right bit masks. In this
case, I specify them in octal, which corresponds to the representation I
use for chmod
and mkdir
(either in Perl or on the command
line):
my $mode = ( stat($file) )[2]; my $is_group_readable = $mode & 040; my $is_group_writable = $mode & 020; my $is_group_executable = $mode & 010;
I don’t like all of those magic number bit masks, though, so I can make them into constants (again, see Chapter 11):
use constant GROUP_READABLE => 040; use constant GROUP_WRITABLE => 020; use constant GROUP_EXECUTABLE => 010; my $mode = ( stat($file) )[2]; my $is_group_readable = $mode & GROUP_READABLE; my $is_group_writable = $mode & GROUP_WRITABLE; my $is_group_executable = $mode & GROUP_EXECUTABLE;
I don’t even have to do that much work, though, because these
already have well-known constants in the POSIX
module. The fcntl_h
export tag gives
me the POSIX constants for file permission masks. Can you tell which one
does what just by looking at them?
#!/usr/bin/perl # posix-mode-constants.pl use POSIX qw(:fcntl_h); # S_IRGRP S_IROTH S_IRUSR # S_IWGRP S_IWOTH S_IWUSR # S_IXGRP S_IXOTH S_IXUSR # S_IRWXG S_IRWXO S_IRWXU # S_ISGID S_ISUID my $mode = ( stat( $ARGV[0] ) )[2]; print "Group readable\n" if $mode & S_IRGRP; print "Group writable\n" if $mode & S_IWGRP; print "Group executable\n" if $mode & S_IXGRP;
The bitwise OR operator, |
,
returns the bits set in either (or both) operand. If a position in
either argument has the bit set, the result has that bit set.
1010 | 1110 ------ 1110
I often use these to combine values, and you may have already been
using them with operators such as sysopen
and flock
. Those built-ins take an argument that
constrains (or allows) their action, and I build up those values by
OR-ing values. Each of the values specifies a setting. The result is the
combination of all of the settings.
The third argument to sysopen
is its mode. If I knew the bit values for the mode settings, I could use
them directly, but they might vary from system to system. I use the
values from Fcntl
instead. I used this in Chapter 3 to limit what my file open can do:
#!/usr/bin/perl -T use Fcntl (:DEFAULT); my( $file ) = $ARGV[0] =~ m/([A-Z0-9_.-]+)/gi; sysopen( my( $fh ), $file, O_APPEND | O_CREAT ) or die "Could not open file: $!\n";
For file locking, I OR the settings I want to get the right
effect. The Fcntl
module supplies the values as
constants. In this example, I open a file in read/write mode and
immediately try to get a lock on the file. I pass the combination on
exclusive lock, LOCK_EX
, and
nonblocking lock, LOCK_NB
, so if I
can’t get the lock right away it dies. By OR-ing those constants, I form
the right bit pattern to send to flock
:
use Fcntl qw(:flock); open my($fh), '<+', $file or die "Connot open: $!"; flock( $fh, LOCK_EX | LOCK_NB ) or die "Cannot lock: $!"; ...; close $fh; # don't unlock, just close!
Without the LOCK_NB
, my program
would sit at the flock
line waiting
to get the lock. Although I simply exited the program in this example, I
might want to sleep
for a bit and try
again, or do something else until I can get the lock.
The bitwise XOR operator, ^
,
returns the bits set in either, but not both, operands.
That’s the part that makes it exclusive. If a position in either
argument has the bit set, the result has the bit set, but only if the
same position in the other argument doesn’t have the bit set. That is,
that bit can only be set in one of the arguments for it to be set in the
result:
1010 ^ 1110 ------ 0100
The bitwise operators also work on strings, although I’m not sure why anyone would ever want to do that outside of an Obfuscated Perl Contest. I’ll show one interesting example, good for quiz shows and contests, but leave the rest up to you. It’s all in perlop.
So, knowing that, what’s the difference between “perl” and “Perl”?
$ perl -e 'printf "[%s]\n", ("perl" ^ "Perl")' [ ]
Okay, that’s a bit hard to see so I’ll use ord
to translate that into its ASCII
value:
$ perl -e 'printf "[%d]\n", ord("perl" ^ "Perl")' [32]
It’s the space character! The ^
masks all of the positions where the bits are set in both strings, and
only the first character is different. It turns out that they differ in
exactly one bit.
I want to see the bit patterns that led to this. The ord
built-in returns the numeric value that I
format with %b
:
$ perl -e 'printf "[%#10b]\n", ord("perl" ^ "Perl")' [0b00100000]
How do I get that value? First, I get the values for the upper- and lowercase versions of the letter P:
$ perl -e 'printf "[%#10b]\n", ord( shift )' P [0b01010000] $ perl -e 'printf "[%#10b]\n", ord( shift )' p [0b01110000]
When I XOR those, I get the bits that are set in only one of the characters. The lowercase characters in ASCII have the same bit values except for the bit 5, putting all the lowercase characters 32 positions above the uppercase ones:
0101_0000 ^ 0111_0000 ---------- 0010_0000
This leads to the perlfaq1 answer that there is only one bit of difference between “perl” and “Perl”.[58]
The bit-shift operators move the entire bit field either to the left,
using <<
, or to the right,
using >>
, and fill in the
vacancies with zeros. The arrows point in the direction I’m shifting,
and the most significant bit (the one that represents the greatest
value) is on the left:
my $high_bit_set = 1 << 8; # 0b1000_0000 my $second_byte = 0xFF << 8; # 0x00_00_FF_00
The bit-shift operators do not wrap values to the other end, although I could write my own subroutine to do that. I’ll just have to remember the parts I’m about to push off the end and add them to the other side. The length of my values depends on my particular architecture, just as I discussed earlier.
I use the bit-shift operator with the return value from system
, which is two bytes (or whatever the libc version of
wait
returns). The low byte has
signal and core information, but it’s the high byte that I actually want
if I need to see the exit value of the external command. I simply shift
everything to the right eight positions. I don’t need to mask the value
since the low byte disappears during the shift:
my $rc = system( 'echo', 'Just another perl hacker, ' ); my $exit_status = $rc >> 8;
I don’t need to save the return value from system
because Perl puts it in the special
variable $?
:
system( 'echo', 'Just another perl hacker, ' ); my $exit_status = $? >> 8;
I can also inspect $?
to see
what went wrong in case of an error. I have to know the proper
masks:
my $signal_id = $? & 0b01111111; # or 0177, 127, 0x7F my $dumped_core = $? & 0b10000000; # or 0200, 128, 0x80
Bit vectors can save memory by using a single scalar to hold many
values. I can use a long string of bits to store the values instead of
using an array of scalars. Even the empty scalar takes up some memory; I
have to pay for that scalar overhead with every scalar I create.
Using Devel::Size
, I can look at the size
of a scalar:
#!/usr/bin/perl # devel-size.pl use Devel::Size qw(size); my $scalar; print "Size of scalar is " . size( $scalar ) . " bytes\n";
On my MacBook running Perl 5.8.8, this scalar takes up 12 bytes, and it doesn’t even have a value yet!
Size of scalar is 12 bytes.
I could use Devel::Peek
to see some of
this:
#!/usr/bin/perl # devel-peek.pl use Devel::Peek; my $scalar; print Dump( $scalar );
The output shows me that Perl has already set up some infrastructure to handle the scalar value:
SV = NULL(0x0) at 0x1807058 REFCNT = 1 FLAGS = (PADBUSY,PADMY)
Even with nothing in it, the scalar variable has a reference count and the scalar flags. Now, imagine an array of several hundred or thousand scalar values, each with their own scalar overhead. That’s a lot of memory before I even think about the values.
I don’t need to use Perl’s arrays to store my data. If I have enough data and another way to store it and then access it, I can save a lot of memory by avoiding the Perl variable overhead.
The easiest thing I can do is use a long string where each character
(or other number of characters) represents an element. I’ll pretend that
I’m working with DNA (the biological sort, although you should probably
use BioPerl
for this sort of thing), and
I’ll use the letters T, A, C, and G to represent the base pairs that make
up the DNA strand (I do this in Chapter 17 where I talk
about tied variables). Instead of storing the sequence as an array of
scalars each holding one character (or even objects representing that
base), I store them as sequential characters in a single string where I
only get the scalar overhead once:
my $strand = 'TGACTTTAGCATGACAGATACAGGTACA';
I can then access the string with substr()
, which I give a starting position and a
length:
my $codon = substr( $strand, 3, 3 );
I can even change values since I can use substr()
as an lvalue:
substr( $strand, 2, 3 ) = 'GAC';
Of course, I can hide these operations behind functions, or I can even make an object out of the string and call methods on it to get or change the parts I want.
One step up the sophistication ladder is pack()
(see Chapter 14), which
does much of the same thing but with much more flexibility. I can shove
several different types into a string and pull them out again. I’ll skip
the example and refer you to the Tie::Array::PackedC
module, which stores a series of integers (or doubles) as a packed
string instead of their numerical and possibly string values in separate
scalar variables.
A bit vector does the same thing as the single string or the packed
string. In one scalar value, it stores several values. Just like in my DNA
example, or the stuff that pack()
does,
it’s up to me how I partition that bit vector and then represent the
values.
The built-in vec()
function treats a string as a bit vector. It divides the string into
elements according to the bit size I specify, although that number has to
be a power of two. It works in the same sense that substr()
works on a string by pulling out part
of it, although vec
only works with one
“element” at a time.
I can use any string that I like. In this example I use 8
for the bit size, which corresponds to
(single-byte) characters:
#!/usr/bin/perl # vec-string.pl my $extract = vec "Just another Perl hacker,", 3, 8; printf "I extracted %s, which is the character '%s'\n", $extract, chr($extract);
From the output, I see that $extract
is the number, and I need to use
chr
to turn it back into its character
representation:
I extracted 116, which is the character 't'
I can also start from scratch to build up the string. The vec
function is an lvalue so I can assign to it.
As with other things in Perl, the first element has the index 0. Since
vec
is dealing with bit fields, to
replace the lowercase p in the string with its
uppercase version, I need to use ord
to
get the numeric version I’ll assign to vec
:
my $bit_field = "Just another perl hacker,"; vec( $bit_field, 13, 8 ) = ord('P'); print "$bit_field\n"; # "Just another Perl hacker,"
I showed earlier that there is only one bit of difference between “perl” and “Perl.” I don’t need to change the entire character; I could just assign to the right bit:[59]
my $bit_field = "Just another perl hacker,"; vec( $bit_field, 109, 1 ) = 0; print "$bit_field\n"; # "Just another Perl hacker,"
When using vec
on a string, Perl
treats it as a byte string, tossing away any other encoding that the
string may have had. That is, vec
can
operate on any string, but it turns it into a byte string. That’s a good
reason not use vec
to play with strings
that I want to use as strings:
#!/usr/bin/perl # vec-drops-encoding.pl use Devel::Peek; # set the UTF-8 flag by including unicode sequence my $string = "Has a unicode smiley --> \x{263a}\n"; Dump( $string ); # keeps the UTF-8 flag on access print STDERR "-" x 50, "\n"; my $first_char = vec( $string, 0, 8 ); Dump( $string ); # loses the UTF-8 flag on assignment print STDERR "-" x 50, "\n"; vec( $string, 0, 8 ) = ord('W'); Dump( $string );
The progression of the Devel::Peek
output shows that I can create a string with the UTF8
flag. As raw bytes, I get the three bytes
\342\230\272
but Perl knows that is a
Unicode code point because of the encoding:
SV = PV(0x1801460) at 0x1800fb8 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x401b10 "Has a unicode smiley --> \342\230\272\n"\0 [UTF8 "Has a unicode smiley --> \x{263a}\n"] CUR = 29 LEN = 32
I can use vec
to extract part of
the string without affecting the UTF8
flag. Simply accessing the string through vec
does set some magic on the variable, but
it’s still UTF8
:
-------------------------------------------------- SV = PVMG(0x180aca0) at 0x1800fb8 REFCNT = 1 FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x401b10 "Has a unicode smiley --> \342\230\272\n"\0 [UTF8 "Has a unicode smiley --> \x{263a}\n"] CUR = 29 LEN = 32 MAGIC = 0x4059d0 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 27
Finally, once I change the string through vec
, Perl treats it as a simple series of bytes.
When I change the initial H
to a
W
, Perl forgets all about the encoding.
It’s up to me to provide the context and meaning of the bits once I use it
as a bit vector. If I want to keep the string value, I should do something
else:
-------------------------------------------------- SV = PVMG(0x180aca0) at 0x1800fb8 REFCNT = 2 FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK) IV = 0 NV = 0 PV = 0x401b10 "Was a unicode smiley --> \342\230\272\n"\0 CUR = 29 LEN = 32 MAGIC = 0x4059d0 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = -1
The actual storage gets a bit tricky, so making a change and then inspecting the scalar I use to store everything, it may seem like the wrong thing is happening. Perl actually stores the bit vector as a string, so on inspection, I most likely see a lot of nonsense:
#!/usr/bin/perl # vec-wacky-order.pl { my @chars = qw( a b c d 1 2 3 ); my $string = ''; for( my $i = 0; $i < @chars; $i++ ) { vec( $string, $i, 8 ) = ord( $chars[$i] ); } print "\@chars string is ---> [$string]\n"; } #------ { my @nums = qw( 9 2 3 12 15 ); my $string = ''; for( my $i = 0; $i < @nums; $i++ ) { vec( $string, $i, 4 ) = 0 + $nums[$i]; } print "\@nums string is ---> [$string]\n"; my $bit_string = unpack( 'B*', $string ); $bit_string =~ s/(....)(?=.)/${1}_/g; print "\$bit_string is ---> [ $bit_string ]\n"; }
With eight bits per element, Perl uses one byte for each element. That’s pretty easy to understand, and nothing tricky happens. The first element in the bit vector is the first byte, the second is the second byte, and so on. The first part of my program creates a string I can recognize, and I see the characters in the order I added them:
@chars string is ---> [abcd123]
The second part of the program is different. I set the bit size to 4 and add several numbers to it. As a string it doesn’t look anything like its elements, but when I look at the bit pattern I can make out my four-bit numbers, although not in the order I added them, and with an apparent extra one:
@nums string is ---> [)√] $bit_string is ---> [ 0010_1001_1100_0011_0000_1111 ] 2 9 12 3 0 15
If I use one, two, or four bits per element, Perl still treats the
string as bytes, but then orders the bits in a little-endian fashion.
I’ll use the alphabet to illustrate the sequence again, this time for
two bytes each. The proper order of the elements is A, B, C, D, but
vec
starts with the lower part of
each byte, which is to the right, and fills the byte up towards the left
before moving to the next byte:
4 bits: B A 2 bits: D C B A 1 bit: H G F E D C B A
I wrote a little program to illustrate the ordering of the
elements. For each of the bit lengths, I get the index of the last
element (counting from zero) as well as the bit pattern of all the bits
on for that bit length by using the oct
function (although I have to remember to
tack on the “0b” to the front). When I run this program, I’ll see a line
that shows the bit field and a line right under it to show the actual
storage:
#!/usr/bin/perl # vec-4bits.pl foreach my $bit_length ( qw( 4 2 1 ) ) { print "Bit length is $bit_length\n"; my $last = (16 / $bit_length) - 1; my $on_bits = oct( "0b" . "1" x $bit_length ); foreach my $index ( 0 .. $last ) { my $string = "\000\000"; vec( $string, $index, $bit_length ) = $on_bits; printf "%2d: ", $index; print show_string( $string ), "\n ", show_ord( $string ), "\n"; } print "\n"; } sub show_string { unpack( "b*", $_[0] ); } sub show_ord { my $result = ''; foreach my $byte ( split //, $_[0] ) { $result .= sprintf "%08b", ord($byte); } $result; }
If I really need to see the bit vector in ones and zeros, I can
use unpack
with the b
format. This orders the bits in the way I
would naturally expect, instead of the tricky order I showed when using
a bit length less than 8 with vec
:
$bit_string = unpack( "b*" , $bit_vector);
I really don’t need to worry about this, though, as long as I use
vec
to both access and store the
values and use the same number of bits each time.
In my earlier DNA example, I had four things to store ( T, A, C, G ). Instead of using a whole character (eight bits) to store each one of those as I did previously, I can use just two bits. In this example, I turn a 12-character string into a bit vector that is only 3 bytes long:
my %bit_codes = ( T => 0b00, A => 0b11, C => 0b10, G => 0b01, ); # add the reverse mapping too @bit_codes{values %bit_codes} = keys %bit_codes; use constant WIDTH => 2; my $bits = ''; my @bases = split //, 'CCGGAGAGATTA'; foreach my $i ( 0 .. $#bases ) { vec( $bits, $i, WIDTH ) = $bit_codes{ $bases[$i] }; } print "Length of string is " . length( $bits ) . "\n";
That’s my bit vector of 12 elements, and now I want to pull out
the third element. I give vec()
three arguments:
the bit vector, the number of the element, and the width in bits of each
element. I use the value that vec()
returns to look up the base symbol in the hash, which maps both
ways:
my $base = vec $bits, 2, WIDTH; printf "The third element is %s\n", $bit_codes{ $base };
I could get more fancy by using four bits per element and using each bit to represent a base. That might seem like a waste of the other three bits, which should be turned off if I know the base already, but sometimes I don’t know the base. I might, for instance, only know that it’s not A, so it might be any of the others. Bioinformaticists have other letters to represent these cases (in this case, B, meaning “not A”), but I don’t need that right now.
In “Generating Sudoku” in The Perl Review, Eric Maki uses bit vectors to represent possible solution states to a Sudoku puzzle. He represents each puzzle row with nine bits, one for each square, and turns on a bit when that square has a value. A row might look like:
0 0 0 1 0 1 1 0 0
For each of the 9 rows in the puzzle, he adds another 9 bits, ending up with a bit string 81 bits long for all of the squares. His solution is a bit more complicated than that, but I’m just interested in the bit operations right now.
It’s very easy for him to check a candidate solution. Once any square has a value, he can eliminate all of the other solutions that also have a value in that square. He doesn’t have to do a lot of work to do that, though, because he just uses bit operations.
He knows which solutions to eliminate since a bitwise AND of the candidate row and the current solution have at least one bit in common. The pivot row is the one from the current solution that he compares to the same row in other candidate solutions. In this example, the rows have a bit in common. The result is a true value, and as before, I don’t need to do any shifting because I only need to know that the result is true, so the actual value is unimportant to me. Let me get to that in a minute:
0 0 1 0 0 0 1 0 0 # candidate row & 0 0 0 1 0 1 1 0 0 # pivot row -------------------- 0 0 0 0 0 0 1 0 0 # bit set, eliminate row
In another case, the candidate row has no bits in common with the same row from the current solution, so an AND gives back all zeros:
0 1 0 0 1 0 0 0 1 # still a candidate row & 0 0 0 1 0 1 1 0 0 # pivot row -------------------- 0 0 0 0 0 0 0 0 0 # false, still okay
I have to be careful here! Since vec()
uses strings, and all strings except “0”
are true (including “00” and so on), I can’t immediately decide based on
the string value if it’s all zeros.
Eric uses bit operations for more than just puzzle solving, though. He also keeps track of all the rows he’s no longer considering. In all, there are 93 placement possibilities, and he stores that as a bit vector. Each bit is a candidate row, although if he sets a bit, that row is no longer a candidate. The index of that bit maps into an array he keeps elsewhere. By turning off rows in his bit mask, he doesn’t have to remove elements from the middle of his data structure, saving him a lot of time Perl would otherwise spend dealing with data structure maintenance. In this case, he uses a bit vector to save on speed, but uses more memory.
Once he knows that he’s going to skip a row, he sets that bit in the
$removed
bit vector:
vec( $removed, $row, 1 ) = 1;
When he needs to know all of the candidate rows still left, that’s just the bitwise negation of the removed rows. Be careful here! You don’t want the binding operator by mistake:
$live_rows = ( ~ $removed );
Although Perl mostly insulates me from the physical details of computers, sometimes I still have to deal with them when the data comes to me packed into bytes. Or, if Perl’s data structures take up too much memory for my problem, I might want to pack my data into bit strings to escape the Perl memory penalty. Once I have the bits, I work with them in mostly the same way I would in other languages.
The perlop documentation shows the bitwise
operators. The perlfunc documentation covers the
built-in function vec
.
Mark Jason Dominus demonstrates proper file locking and the
Fcntl
module in the slides to his “File Locking Tricks
and Traps” talk. There’s plenty of the bitwise OR operator in the
discussion (http://perl.plover.com/yak/flock/).
Eric Maki wrote “Generating Sudoku” for The Perl
Review 2.2 (Spring 2006) and used vec
to keep track of the information without
taking up much memory.
I wrote “Working with Bit Vectors” for The Perl Review 2.2 (Spring 2006) to complement Eric’s article on Sudoku. That article formed the basis of this chapter, although I greatly expanded it here.
Maciej Ceglowski writes about “Bloom Filters” for Perl.com. Bloom filters hash data to store its keys without storing the values, which makes heavy use of bit operations (http://www.perl.com/lpt/a/2004/04/08/bloom_filters.html).
If vec
and Perl’s bit operations
aren’t enough for you, take a look at Stephen Breyer’s
Bit::Vector
module. It allows for bit vectors with
arbitrary element size.
Randal Schwartz wrote “Bit Operations” for Unix Review, January 1998: http://www.stonehenge.com/merlyn/UnixReview/col18.html.
[54] Isn’t that cute how they misspelled both “bite” and “nibble”?
[55] This works for the other bases, too, so a %#x
gets 0x
and %#o
gets 0
. If the number is 0, however, it doesn’t
really matter which base its in so Perl doesn’t give it any
prefix.
[56] That would be the Waite Group’s New C Primer Plus. They had a C++ book, too, and I called it the “New C Plus Plus Primer Plus.” Last time I looked you could still buy these used on Amazon for under a dollar.
[57] This is one of the few places in Perl where the underlying architecture shows through. This depends on the integer size of your processor.
[58] Although it also explains that we typically use “perl” to refer to the actual binary program and “Perl” for everything else.
[59] How did I know the right bit? I’m lazy. I used
foreach
my
$bit
(
100
..
116
)
and
chose the one that worked.
Get Mastering Perl now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.