Chapter 4. Web-Oriented Data Encoding
In the field of observation, chance favors only the prepared mind.
Even though web applications have all sorts of different purposes, requirements, and expected behaviors, there are some basic technologies and building blocks that show up time and again. If we learn about those building blocks and master them, then we will have versatile tools that can apply to a variety of web applications, regardless of the application’s specific purpose or the technologies that implement it.
One of these fundamental building blocks is data encoding. Web applications ship data back and forth from the browser to the server in myriad ways. Depending on the type of data, the requirements of the system, and the programmer’s particular preferences, that data might be encoded or packaged in any number of different formats. To make useful test cases, we often have to decode the data, manipulate it, and reencode it. In particularly complicated situations, you may have to recompute a valid integrity check value, like a checksum or hash. The vast majority of our tests in the web world involve manipulating the parameters that pass back and forth between a server and a browser, but we have to understand how they are packed and shipped before we can manipulate them.
In this chapter, we’ll talk about recognizing, decoding, and encoding several different formats: Base 64, Base 36, Unix time, URL encoding, HTML encoding, and others. This is not so much meant to be a reference for these formats (there are plenty of good references). Instead, we will help you know it when you see it and manipulate the basic formats. Then you will be able to design test data carefully, knowing that the application will interpret your input in the way you expect.
The kinds of parameters we’re looking at appear in lots of independent
places in our interaction with a web application. They might be hidden form
field values, GET parameters in the URL, or values in the cookie. They might
be small, like a 6-character discount code, or they might be large, like
hundreds of characters with an internal composite structure. As a tester,
you want to do boundary case testing and negative testing that addresses
interesting cases, but you cannot figure out what is
interesting if you don’t understand the format and use of the data. It is
difficult to methodically generate boundary values and test data if you do
not understand how the input is structured. For example, if you see dGVzdHVzZXI6dGVzdHB3MTIz
in an HTTP header, you
might be tempted to just change characters at random. Decoding this with a
Base-64 decoder, however, reveals the string testuser:testpw123
. Now you have a much better
idea of the data, and you know how to modify it in ways that are relevant to
its usage. You can make test cases that are valid and carefully targeted at
the application’s behavior.
4.1. Recognizing Binary Data Representations
Problem
You have decoded some data in a parameter, input field, or data file and you want to create appropriate test cases for it. You have to determine what kind of data it is so that you can design good test cases that manipulate it in interesting ways.
We will consider these kinds of data:
Hexadecimal (Base 16)
Octal (Base 8)
Base 36
Solution
Hexadecimal data
Hexadecimal characters, or Base-16 digits, are the numerical digits 0–9 and the letters A–F. You might see them in all uppercase or all lowercase, but you will rarely see the letters in mixed case. If you have any letters beyond F in the alphabet, you’re not dealing with Base 16.
Although this is fundamental computer science material here, it
bears repeating in the context of testing. Each individual byte of
data is represented by two characters in the output. A few things to
note that will be important: 00 is 0 is NULL, etc. That’s one of our
favorite boundary values for testing. Likewise, FF
is 255, or −1, depending on whether it’s
an unsigned or signed value. It’s our other favorite boundary value.
Other interesting values include 20, which is the ASCII space
character, and 41, which is ASCII for uppercase A. There are no
common, printable ASCII characters above 7F
. In most programming languages,
hexadecimal values can be distinguished by the letters 0x
in front of them. If you see 0x24
, your first instinct should be to treat
it as a hexadecimal number. Another common way of representing
hexadecimal values is with colons between individual bytes. Network
MAC addresses, SNMP MIB values, X.509 certificates, and other
protocols and data structures that use ASN.1 encoding frequently do
this. For example, a MAC address might be represented: 00:16:00:89:0a:cf
. Note that some
programmers will omit unnecessary leading zeros. So the above MAC
address could be represented: 0:16:0:89:a:cf
. Don’t let the fact that some
of the data are single digits persuade you that it isn’t a series of
hexadecimal bytes.
Octal data
Octal encoding—Base 8—is somewhat rare, but it comes up from time to time. Unlike
some of the other Bases (16, 64, 36), this one uses fewer than all 10
digits and uses no letters at all. The digits 0 to 7 are all that are
used. In programming, octal numbers are frequently represented by a
leading zero, e.g., 017 is the same as 15 decimal or 0F hexadecimal.
Don’t assume octal, however, if you see leading zeroes. Octal is too
rare to assume just on that evidence alone. Leading zeroes typically
indicate a fixed field size and little else. The key distinguishing
feature of octal data is that the digits are all numeric with none
greater than 7. Of course, 00000001
fits that description but is probably not octal. In fact, this
decoding could be anything, and it doesn’t matter. 1 is 1 is 1 in any
of these encodings!
Base 36
Base 36 is rather an unusual hybrid between Base 16 and Base 64. Like Base 16, it begins at 0 and carries on into the alphabet after reaching 9. It does not stop at F, however. It includes all 26 letters up to Z. Unlike Base 64, however, it does not distinguish between uppercase and lowercase letters and it does not include any punctuation. So, if you see a mixture of letters and numbers, and all the letters are the same case (either all upper or all lower), and there are letters in the alphabet beyond F, you’re probably looking at a Base-36 number.
4.2. Working with Base 64
Problem
Base 64 fills a very specific niche: it encodes binary data that is not printable or safe for the channel in which it is transmitted. It encodes that data into something relatively opaque and safe for transmission using just alphanumeric characters and some punctuation. You will encounter Base 64 wrapping most complex parameters that you might need to manipulate, so you will have to decode, modify, and then reencode them.
Solution
Install OpenSSL in Cygwin (if you’re using Windows) or make sure you have
the openssl
command if you’re using
another operating system. All known distributions of Linux and Mac OS X
will have OpenSSL.
Discussion
You will see Base 64 a lot. It shows up in many HTTP headers (e.g., the Authorization:
header) and most
cookie values are Base 64-encoded. Many applications
encode complex parameters with Base 64 as well. If you see encoded data,
especially with equals characters at the end, think Base 64.
Notice the -n
after the
echo
command. This
prevents echo from appending a newline character on the end of the
string that it is provided. If that newline character is not suppressed,
then it will become part of the output. Example 4-1 shows the two
different commands and their respective output.
% echo -n '&a=1&b=2&c=3' | openssl base64 -e # Right. JmE9MSZiPTImYz0z % echo '&a=1&b=2&c=3' | openssl base64 -e # Wrong. JmE9MSZiPTImYz0zCg==
This is also a danger if you insert your binary data or raw data
in a file and then use the -in
option to encode the entire file.
Virtually all editors will put a newline on the end of the last line of
a file. If that is not what you want (because your file contains binary
data), then you will have to take extra care to create your
input.
You may be surprised to see us using OpenSSL for this, when
clearly there is no SSL or other encryption going on. The openssl
command is a bit of a Swiss Army
knife. It can perform many operations, not just cryptography.
Recognizing Base 64
Base-64 characters include the entire alphabet, upper- and lowercase, as well as the ten digits 0–9. That gives us 62 characters. Add in plus (+) and solidus (/) and we have 64 characters. The equals sign is also part of the set, but it will only appear at the end. Base-64 encoding will always contain a number of characters that is a multiple of 4. If the input data does not encode to an even multiple of 4 bytes, one or more equals (=) will be added to the end to pad out to a multiple of 4. Thus, you will see at most 3 equals, but possibly none, 1, or 2. The hallmark of Base 64 is the trailing equals. Failing that, it is also the only encoding that uses a mixture of both upper- and lowercase letters.
Warning
It is important to realize that Base 64 is an encoding. It is not encryption (since it can be trivially reversed with no special secret necessary). If you see important data (e.g., confidential data, security data, program control data) Base-64-encoded, just treat it as if it were totally exposed and in the clear—because it is. Given that, put on your hacker’s black hat and ask yourself what you gain by knowing the data that is encoded.
Note also that there is no compression in Base 64. In fact, the encoded data is guaranteed to be larger than the unencoded input. This can be an issue in your database design, for example. If your program changes from storing raw user IDs (that, say, have a maximum size of 8 characters) to storing Base-64-encoded user IDs, you will need 12 characters to store the result. This might have ripple effects throughout the design of the system—a good place to test for security issues!
Other tools
We showed OpenSSL in this example because it is quick, lightweight, and easily accessible. If you have CAL9000 installed, it will also do Base-64 encoding and decoding easily. Follow the instructions in Recipe 4.5, but select “Base 64” as your encoding or decoding type. You still have to watch out for accidentally pasting newlines into the input boxes.
There is a MIME::Base64
module for Perl. Although it is not a standard module,
you’ll almost certainly have it if you use the LibWWWPerl module we
discuss in Chapter 8.
4.3. Converting Base-36 Numbers in a Web Page
Problem
You need to encode and decode Base-36 numbers and you don’t want to write a script or program to do that. This is probably the easiest way if you just need to convert occasionally.
Solution
Brian Risk has created a demonstration website at http://www.geneffects.com/briarskin/programming/newJSMathFuncs.html that performs conversions to arbitrary conversions from one base to another. You can go back and forth from Base 10 to Base 36 by specifying the two bases in the page. Figure 4-1 shows an example of converting a large Base-10 number to Base 36. To convert from Base 36 to Base 10, simply swap the 10 and the 36 in the web page.
Discussion
Just because this is being done in your web browser does not mean you have to be online and connected to the Internet to do this. In fact, like CAL9000 (see Recipe 4.5), you can save a copy of this page to your local hard drive and then load it in your web browser whenever you need to do these conversions.
4.4. Working with Base 36 in Perl
Problem
You need to encode or decode Base-36 numbers a lot. Perhaps you have many numbers to convert or you have to make this a programmatic part of your testing.
Solution
Of the tools we use in this book, Perl is the tool of choice. It
has a library Math::Base36
that you can install using
the standard CPAN or ActiveState method for installing modules. (See
Chapter 2). Example 4-2 shows
both encoding and decoding of Base-36 numbers.
#!/usr/bin/perl use Math::Base36 qw(:all); my $base10num = 67325649178; # should convert to UXFYBDM my $base36num = "9FFGK4H"; # should convert to 20524000481 my $newb36 = encode_base36( $base10num ); my $newb10 = decode_base36( $base36num ); print "b10 $base10num\t= b36 $newb36\n"; print "b36 $base36num\t= b10 $newb10\n";
4.5. Working with URL-Encoded Data
Problem
URL-encoded data uses the %
character and hexadecimal digits to transmit characters
that are not allowed in URLs directly. The space, angle brackets
(<
and >
), and slash (solidus, /
) are a few common examples. If you see
URL-encoded data in a web application (perhaps in a parameter, input, or
some source code) and you need to either understand it or manipulate it,
you will have to decode it or encode it.
Solution
The easiest way is to use CAL9000 from OWASP. It is a series of HTML web pages that use JavaScript to perform the basic calculations. It gives you an interactive way to copy and paste data in and out and encode or decode it at will.
Encode
Enter your decoded data into the “Plain Text” box, then click on the “Url (%XX)” button to the left under “Select Encoding Type.” Figure 4-2 shows the screen and the results.
Decode
Enter your encoded data into the box labeled “Encoded Text,” then click on the “Url (%XX)” option to the left, under “Select Decoding Type.” Figure 4-3 shows the screen and the results.
Discussion
URL-encoded data is familiar to anyone who has looked at HTML
source code or any behind-the-scenes data being sent from a web browser
to a web server. RFC 1738 (ftp://ftp.isi.edu/in-notes/rfc1738.txt) defines URL
encoding, but it does not require encoding of certain ASCII characters.
Notice that, although it isn’t required, there is nothing wrong with
unnecessarily encoding these characters. The encoded data in Figure 4-3 shows an example of this. In
fact, redundant encoding is one of the ways that attackers mask
their malicious input. Naïve blacklists that check for <script>
or even %3cscript%3e
might not check for %3c%73%63%72%69%70%74%3e
, even though all of
them are essentially the same.
One of the great things about CAL9000 is that it is not really software. It is a collection of web pages that have JavaScript embedded in them. Even if your IT policies are super-draconian and you cannot install anything at all on your workstation, you can open these web pages in your browser from a local hard disk and they will work for you. You can easily load them onto a USB drive and load them straight from that drive, so that you never install anything at all.
4.6. Working with HTML Entity Data
Problem
The HTML specification provides a way to encode special characters so that they are not interpreted as HTML, JavaScript, or another kind of command. In order to generate test cases and potential attacks, you will need to be able to perform this kind of encoding and decoding.
Solution
The easiest choice for this kind of encoding and decoding is CAL9000. We won’t repeat the detailed instructions on how to use CAL9000 because it is pretty straightforward to use. See Recipe 4.5 for detailed instructions.
To encode special characters, you enter the special characters in the box labeled “Plain Text” and choose your encoding. You will want to enter a semicolon (;) in the “Trailing Characters” box in CAL9000.
Decoding HTML Entity-encoded characters is the same process in reverse. Type or paste the entity-encoded characters into the “encoded text box” and then click on the “HTML Entity” entry under “Select Decoding Type.”
Discussion
HTML entity encoding is an area rich with potential mistakes. The
authors have seen many web applications perform multiple rounds of
entity encoding (e.g., the ampersand is encoded as &amp;
) in one part of the display and
perform no entity encoding in other parts of the display. Not only is it
important to do correctly, it turns out that since there are so many
variations on HTML entity encoding, it is very challenging to write a
web application that does handle encoding correctly.
Variations on a theme
There are at least five or six legitimate, relatively well-known
methods to encode the same character using HTML entity encoding. Table 4-1 shows a few possible
encodings for a single character: the less-than sign (<
).
Encoding variation | Encoded character |
Named entity | < |
Decimal value (ASCII or ISO-8859-1) | < |
Hexadecimal value (ASCII or ISO-8859-1) | < |
Hexadecimal value (long integer) | < |
Hexadecimal value (64-bit integer) | < |
There are even a few more encoding methods that are specific to Internet Explorer. Clearly, from a testing point of view, if you have boundary values or special values you want to test, you have at least six to eight permutations of them: two or three URL-encoded versions and four or five entity-encoded versions.
4.7. Calculating Hashes
Problem
When your application uses hashes, checksums, or other integrity checks over its data, you need to recognize them and possibly calculate them on test data. If you are unfamiliar with hashes, see the upcoming sidebar “What Are Hashes?.”
Solution
As with other encoding tasks, you have at least three good choices: OpenSSL, CAL9000, and Perl.
Discussion
The MD5 case is shown using OpenSSL on Unix or on Windows. OpenSSL
has an equivalent sha1
command. Note
that the -n
is required on Unix
echo
command to prevent the newline
character from being added on the end of your data. Although Windows has
an echo command, you can’t use it the same way
because there is no way to suppress the carriage-return/linefeed set of
characters on the end of the message you give it.
The SHA-1 case is shown as a Perl script, using the Digest::SHA1
module. There is an equivalent Digest::MD5
module that works the same way for
MD5 hashes.
Note that there is no way to decode a hash. Hashes are mathematical digests that are one-way. No matter how much data goes in, the hash produces exactly the same size output.
MD5 hashes
MD5 hashes produce exactly 128 bits (16 bytes) of data. You might see this represented in a few different ways:
- 32 hexadecimal characters
df02589a2e826924a5c0b94ae4335329
.- 24 Base 64 characters
PlnPFeQx5Jj+uwRfh//RSw==
. You will see it this way if they take the binary output of MD5 (the raw 128 binary bits) and then Base-64 encode it.
4.8. Recognizing Time Formats
Problem
You are likely to see time represented in a lot of different ways. Recognizing a representation of time for what it is will help you build better test cases. Not only knowing that it is time, but knowing what the programmer’s fundamental assumptions might have been when the code was written makes it easier to write targeted test cases.
Solution
Obvious time formats encode the year, month, and day in familiar arrangements, providing either two or four digits for the year. Some include hours, minutes, and seconds, possibly with a decimal and milliseconds. Table 4-2 shows several representations of June 1, 2008, 5:32:11 p.m. and 844 milliseconds. Some of the formats do not represent certain parts of the date or time. The unrepresentable parts are omitted as appropriate.
Discussion
You may think that recognizing time is pretty obvious and not important to someone testing web applications, especially for security. We would argue that it is actually very important. The authors have seen many applications where time was considered to be unpredictable by the developers. Time has been used in session IDs, temporary filenames, temporary passwords, and account numbers. As a simulated attacker, you know that time is not unpredictable. As we plan “interesting” test cases on a given input field, we can narrow down the set of possible test values dramatically if we know it corresponds to a time value from the recent past or recent future.
4.9. Encoding Time Values Programmatically
Problem
You have determined that your application uses time in some interesting way, and now you want to generate specific values in specific formats.
Solution
Perl is a great tool for this job. You will need the Time::Local
module for some manipulations of
Unix time and the POSIX
module for
strftime
. Both are standard modules.
The code in Example 4-3
shows you four different formats and how to calculate them.
#!/usr/bin/perl use Time::Local; use POSIX qw(strftime); # June 1, 2008, 5:32:11pm and 844 milliseconds $year = 2008; $month = 5; # months are numbered starting at 0! $day = 1; $hour = 17; # use 24-hour clock for clarity $min = 32; $sec = 11; $msec = 844; # UNIX Time (Seconds since Jan 1, 1970) 1212355931 $unixtime = timelocal( $sec, $min, $hour, $day, $month, $year ); print "UNIX\t\t\t$unixtime\n"; # populate a few values (wday, yday, isdst) that we'll need for strftime ($sec,$min,$hour,$mday,$mon,$year, $wday,$yday,$isdst) = localtime($unixtime); # YYYYMMDDhhmmss.sss 20080601173211.844 # We use strftime() because it accounts for Perl's zero-based month numbering $timestring = strftime( "%Y%m%d%H%M%S", $sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst ); $timestring .= ".$msec"; print "YYYYMMDDhhmmss.sss\t$timestring\n"; # YYMMDDhhmm 0806011732 $timestring = strftime( "%y%m%d%H%M", $sec,$min,$hour,$mday, $mon,$year,$wday,$yday,$isdst ); print "YYMMDDhhmm\t\t$timestring\n"; # POSIX in "C" Locale Sun Jun 1 17:32:11 2008 $gmtime = localtime($unixtime); print "POSIX\t\t\t$gmtime\n";
Discussion
You can use perldoc Time::Local
or man strftime
to find out more
about possible ways to format time.
Perl’s Time Idiosyncrasies
Although Perl is very flexible and is definitely a good tool for this job, it has its idiosyncrasies. Be careful of the month values when writing code like this. For some inexplicable reason, they begin counting months with 0. That is, January is 0, and February is 1, instead of January being 1. Days are not done this way. The first day of the month is 1. Furthermore, you need to be aware of how the year is encoded. It is the number of years since 1900. Thus, 1999 is 99 and 2008 is 108. To get a correct value for the year, you must add 1900. Despite all the year 2000 histrionics, there are websites to this day that show the date as 6/28/108.
4.10. Decoding ASP.NET’s ViewState
Problem
ASP.NET provides a mechanism by which the client can store state, rather
than the server. Even relatively large state objects (several kilobytes)
can be sent as form fields and posted back by the web browser with every
request. This is called the ViewState and is stored in an input called
__VIEWSTATE
on the form. If your
application uses this ViewState, you will want to investigate how the
business logic relies on it and develop tests around corrupt ViewStates.
Before you can build tests with corrupt ViewStates, you have to
understand the use of ViewState in the application.
Solution
Get the ViewState Decoder from Fritz Onion (http://www.pluralsight.com/tools.aspx). The simplest use case is to copy and paste the URL of your application (or a specific page) into the URL. Figure 4-4 shows version 2.1 of the ViewState decoder and a small snapshot of its output.
Discussion
Sometimes the program fails to fetch the ViewState from the web
page. That’s really no problem. Just view the source of the web page
(see Recipe 3.2) and search for
<input type= "hidden"
name="__VIEWSTATE"...>
. Take the value of that input and
paste it into the decoder.
If the example in Figure 4-4 was your application, it would suggest several potential avenues for testing. There are URLs in the ViewState. Can they contain JavaScript or direct a user to another, malicious website? What about the various integer values?
There are several questions you should ask yourself about your application, if it is using ASP.NET and the ViewState:
Is any of the data in the ViewState inserted into the URL or HTML of the subsequent page when the server processes it?
Consider that Figure 4-4 shows several URLs. What if page navigation links were derived from the ViewState in this application? Could a hacker trick someone into visiting a malicious site by sending them a poisoned ViewState?
Is the ViewState protected against tampering?
ASP.NET provides several ways to protect the ViewState. One of them is a simple hash code that will allow the server to trap an exception if the ViewState is modified unexpectedly. The other is an encryption mechanism that makes the ViewState opaque to the client and a potential attacker.
Does any of the program logic depend blindly on values from the ViewState?
Imagine an application where the user type (normal versus administrator) was stored in the ViewState. An attacker merely needs to modify it to change his effective permissions.
When it comes time to create tests for corrupted ViewStates, you will probably use tools like TamperData (see Recipe 3.6) or WebScarab (see Recipe 3.4) to inject new values.
4.11. Decoding Multiple Encodings
Problem
Sometimes data is encoded multiple times, either intentionally or
as a side effect of passing through some middleware. For example, it is common to see the
nonalphanumeric characters (=
,
/
, +
) in a Base 64-encoded string (see Recipe 4.2) encoded with URL encoding (see Recipe 4.5). For example, V+P//z==
might be displayed as V%2bP%2f%2f%3d%3d
. You’ll need to be aware of
this so that when you’ve completed one round of successful decoding, you
treat the result as potentially more encoded data.
Solution
Sometimes a single parameter is actually a specially structured
payload that carries many parameters. For example, if we see AUTH=dGVzdHVzZXI6dGVzdHB3MTIz
, then we might
be tempted to consider AUTH
to be one
parameter. When we realize that the value decodes to testuser:testpw123
, then we realize that it is
actually a composite parameter containing a user ID and a password, with
a colon as a delimiter. Thus, our tests will have to manipulate the two
pieces of this composite differently. The rules and processing in the
web application are almost certainly different for user IDs and
passwords.
Discussion
We don’t usually include quizzes as a follow-up to a recipe, but in this case it might be worthwhile. Recognizing data encodings is a pretty important skill, and an exercise here may help reinforce what we’ve just explained. Remember that some of them may be encoded more than once. See if you can determine the kind of data for each of the following (answers in the footnotes):
Get Web Security Testing Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.