Web Security, Privacy & Commerce, 2nd EditionBy Simson Garfinkel with Gene Spafford
2nd Edition November 2001
0-596-00045-6, Order Number: 0456
786 pages, $44.95
The Web's War on Your Privacy
In this chapter:
You watch the Web, and the Web watches you. With a few notable exceptions, every time you look at a page on the World Wide Web, somewhere there is a computer that makes note of this fact. Visit a web site designed for parents of small children, then visit another site that is devoted to consumer electronics, and somewhere a computer slowly builds a profile of your interests. Take a few minutes to "register" for an account with your email address, and you'll soon start receiving a stream of emails in your inbox hawking "special offers."
As the Web has created unprecedented opportunities for consumers, it has also created heretofore unimaginable possibilities for marketers, sales organizations, hucksters, tricksters, and outright criminals. A marketing company that puts a billboard up by a highway is content knowing how many cars per day drive by its sign. That same company putting a banner advertisement up on a popular web site would like to know far more information about the people seeing its message--where they live, whether they get their Internet access from a business or through a university, what other web sites the person has visited, and sometimes, even their email addresses. It can be exceedingly difficult to determine the effectiveness of billboards and magazine advertisements. Web advertisements, by contrast, can be metered, examined, and analyzed. All of this power comes at a price to individual privacy, because detailed statistics require more detailed data collection.
The underlying technology of the Internet and the Web was designed to transfer information, not protect the privacy of people who use that information. There are now a whole host of web technologies that make it possible for web sites and third-party services to collect information on web users.
This chapter introduces the broad issue of privacy on the Web and the growing privacy threats. In Chapter 9 we'll describe some straightforward approaches that you can use to protect your privacy. Then in Chapter 10 we'll take a look at some additional software that you can run on your computer to further protect your privacy while enjoying the benefits of the Internet and the Web.
As with most big concepts, people have different definitions for the word privacy. The Merriam-Webster dictionary dates the word privacy back to the 15th century and defines it as "the quality or state of being apart from company or observation" and "freedom from unauthorized intrusion."
The Tort of Privacy
In a famous 1890 article in the Harvard Law Review, Samuel Warren and Louis Brandeis argued that there should be a right to privacy, and that right should "protect those persons with whose affairs the community has no legitimate concern, from being dragged into an undesirable and undesired publicity" and "protect all persons, whatsoever; their position or station, from having matters which they may properly prefer to keep private, made public against their will."
Interestingly, Warren and Brandeis wrote that "truth of the matter published does not afford a defense." They held that a person's privacy is violated by a portrayal of that person's private life whether the portrayal is accurate or inaccurate. Finally, they wrote that: "The absence of `malice' in the publisher does not afford a defense. Personal ill-will is not an ingredient of the offense, any more than in an ordinary case of trespass to person or to property." Over the past 110 years, the privacy violations described in the Warren/Brandeis paper have been reduced to four torts in American law:
- Privacy intrusion
- For example, intruding into a person's private sphere.
- Disclosure of private facts
- For example, the publication of private information about an individual for which the public has no compelling interest to have this information known.
- Portrayal of information in false light
- For example, publishing lurid details of a person's private life that aren't actually true, or information that is strictly true but easily misinterpreted. This tort is similar to defamation, but it is not the same: works that do not defame can nevertheless portray a subject in false light. The false light tort is most common in works that fictionalize real people.
- For example, using a person's name or likeness for a commercial purpose without that person's permission.
The Harvard Law Review article was the basis for much legislation and litigation in the following years. But despite their vision, Warren and Brandeis didn't create a framework that extended to the computer age, where personal information for millions is now routinely collected, tabulated, indexed, used, and sold. Although similar to the tort of appropriation, the intrusions we face in the computer age have a distinctly different flavor.
In 1967, Columbia University professor Alan Westin created a new definition for privacy that seemed more appropriate to the computer age. Westin defined the term informational privacy as "the claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent information about them is communicated to others."
All of these types of privacy come into play on the Web today. Stalkers, spammers, and nosy family members routinely intrude into our mailboxes. Gossips and buggy programs alike distribute private facts beyond their intended audience. Some web sites will appropriate the names of their subscribers and use this information in marketing. Distributing information about a demographic, and then saying that a particular user is a member of that demographic may constitute false light. But the largest number of violations of personal privacy on the Web today fall into Westin's characterization of informational privacy--that is, many individuals have lost the ability to control how and to what extent information about them is communicated to marketing firms, government agencies, and nosy neighbors in the world's electronic village.
Personal, Private, and Personally Identifiable Information
The first thing that's apparent when you start to pick apart Westin's definition of information privacy is that there are many different kinds of "information" the definition can be applied to. The word "information" in Westin's definition could apply to a person's name, and it would certainly apply to a piece of paper that had a person's name, his Social Security number, and the list of web sites that the person had visited over the past month. But what if that piece of paper showed only the list of web sites and the first three digits of the person's Social Security number--would that piece of paper be considered personal information?
To deal with questions like this, academics have subdivided the term "information" into many different subcategories. A few of them are:
- Personal information
- Information about a person. Your name, your date of birth, the school you attended, and the names of your parents are all personal information.
- Private information
- Personal information that is not generally known. Some kinds of private information are protected by law. For example, in the United States education records are considered private and cannot be released without the permission of the individual (or the individual's parent or guardian, in the case of a minor). Bank records are protected by law, although banks are allowed to sell the names and addresses of their customers for marketing purposes.
- Most people have a large amount of information that they consider private but that is not protected under the law. For example, you might consider the name of the first person that you kissed to be private. Other information should be treated as private, even though it is widely available. For example, most people regard their Social Security numbers as private, even though they are available in many databases. This ambiguity arises in part because private is not a synonym for secret or confidential.
- Whether or not a particular piece of information is private frequently depends on the context. For example, if your name is in a telephone directory, that information is not private. But if that directory is on the computer of an individual who is engaged in illegal activity, you might wish to keep the fact that your name is in his address book extremely private.
- Personally identifiable information
- Information from which a person's name or identity can be derived. Some personally identifiable information is obvious, such as a person's name or an account number. Some personal information, such as your shoe size, is not generally identifiable.
- Anonymized information
- The reverse of personally identifiable information. This is personal or private information that has been modified in some way so that identities of the individuals from whom the information was collected can no longer be discerned.
- Aggregate information
- Statistical information combined from many individuals to form a single record. One of the best examples of aggregate information is the statistics on census tracts that are released by the U.S. Census Bureau. According to the Bureau, "Census tracts usually have between 2,500 and 8,000 persons and, when first delineated, are designed to be homogeneous with respect to population characteristics, economic status, and living conditions. Census tracts do not cross county boundaries. The spatial size of census tracts varies widely depending on the density of settlement. Census tract boundaries are delineated with the intention of being maintained over a long time so that statistical comparisons can be made from census to census."
In practice, these categories of personal information are far more fluid than it may seem at first. Often, aggregate information and anonymized information can be combined to identify and reveal particular characteristics of an individual. This process is called triangulation. For example, if you have a class with ten students, and you know that nine of the students are men and one of the students is pregnant, you know with some certainty which student in the class is pregnant. If you have a list of the names of the individuals in the class, you probably know the name of the woman who is pregnant, because most names are strongly identified with a particular gender.
Many Internet users are surprised how easy it is to determine identity from the seemingly anonymous information they provide to web sites. For example, some web sites require a person register with a name and address, while other web sites require only a Zip code and birthday. Yet for many people in the United States, there are only ten or so people who live in the same Zip code and share the same birthday. Consider:Number of individuals in the U.S. = approximately 284,000,000 (as of April 2001)
Number of birthdays in the U.S. = 365.25
Number of individuals in the U.S. with each birthday = 284,000,000 / 365.25
= approximately 777,549
Number of Zip codes in the U.S. = approximately 100,000
Number of individuals in each zip code with the same birthday
= 777,549 / 100,000 = approximately 8 people
Thus, a web site that asks a visitor for a birthday, a Zip code, and an age is actually asking its visitors for personally identifiable information, even though it appears to be only asking for aggregate information. If that web site is hooked into the credit files of a company such as Equifax or Experian, the web site might, in turn, have access to information that the visitor considers personal and private, but that is, in fact, quite public and frequently shared among business partners.
Some of the most detailed, revealing, and damaging sources of personal information on the Internet are Internet users themselves. If you want to buy a t-shirt or a compact disc on the Web, you need to give that web merchant a name and address where the merchandise will be shipped. As the vast majority of the purchases made on the Web are made with credit cards, you'll probably also need to give a credit card number. And because there is a lot of fraud on the Internet, you probably won't get your merchandise without a lot of hassle unless you provide the name and the billing address for the credit card used to pay for the order.
Most web merchants go beyond the minimal information needed to satisfy online orders. For example, a merchant might ask for your email address and a few phone numbers, to allow the merchant to contact you in the event of a mishap. Many merchants set up accounts for their customers so that this information doesn't need to be entered time and again. These accounts require usernames and passwords. These accounts can be used to track a person's purchases over time. Some merchants go further, and ask their customers to provide the city of their birth, or their mother's maiden names, so that if a consumer forgets his password, another question can be asked (see Figures 8-1 and 8-2)
Figure 8-1. By far, the greatest kind of personal information on the Web today is the information provided by consumers when they register at web sites.
Figure 8-2. Disney's registration page for adults asks for name, email address, gender, and birthday, in addition to mailing address. Many people are surprised how identifying even simple demographic information can be. For example, in many cases a person can be uniquely identified by day of birth (without the year) and Zip code.
At the present time in the United States, there are few restrictions on what web sites can do with personal information once it is collected. While some merchants have posted so-called privacy policies, which may outline some of their rules and restrictions on personal information, posting of privacy policies is voluntary. (For more information on privacy policies, see Chapter 24.) Many other countries have more restrictive rules about what can and cannot be done with personal information, although these countries have, for the most part, been slow to extend their legal regimes to the world of the Internet.
While information provided by users may be the most detailed information collected, by far the most pervasive information collection comes from the operation of the network itself. This data is stored in log files created by network programs and devices.
Log files are ubiquitous. Programmers add log files to their programs to assist in writing and debugging. System operators leave log files enabled so they can verify that software is working correctly, and so they can diagnose the cause of problems when things do not operate properly. Governments and marketers use this information because it is an excellent source of data.
Computers are extraordinarily complicated systems; few system operators are aware of all the log files that their computers create. Many times, a system operator will firmly assert that a particular piece of information is not being retained by their computer system, only to discover that in fact the information is being retained, somewhere in a log file.
There is fundamentally no way for the user of a computer system to know with certainty if a log file is being created of the user's activities. Many organizations that have assured users that records were not being kept of user actions have later discovered that activities were in fact logged. Likewise, many organizations that assumed activities were logged have later discovered problems with the logging system.
Retention and Rotation
Some computer systems automatically age and discard old log files, a process that is called rotation. On other computer systems, there is no formal system for discarding old log file information: these systems retain log files until their disks fill up and somebody manually deletes the log file entries.
For the same reasons it is impossible for the user of a computer system to know if her actions are being logged, it is also impossible to know how long log files are actually retained. Here is an example of a few log files from a moderately busy web server:
% ls -l access*
-rw-r----- 1 root www 312714072 Apr 19 13:42 access_log
-rw-r----- 1 root www 401536508 Apr 15 00:00 access_log.1
-rw-r----- 1 root www 32408676 Apr 8 00:00 access_log.2.gz
-rw-r----- 1 root www 31062796 Apr 1 00:00 access_log.3.gz
This computer appears to retain log files for one month. The file access_log contains a record for each web page downloaded since the beginning of April 15. The file access_log.1 contains a list of all web pages downloaded from the start of April 8 to the end of April 14. The files access_log.2.gz and access_log.3.gz are for the two preceding weeks. These files are smaller than the first two files because they were compressed.
Despite appearances, the organization that operates this web server actually maintains log files for a significantly longer period of time. This is because the organization backs up the directory that contains the log files to magnetic tape. These tapes are stored off-site in a safe deposit box. Although there are no specific records of which log files are backed up and which are not, in an emergency (or under a court order), it might be possible for this organization to retrieve log file records that are a year old or even older.
Practically every time a web browser downloads a page on the Web, a record of this event is routinely recorded in the log files of the remote web server. If the web page is assembled using a database server, the database server may create log files of its own. Finally, web logs are also routinely kept on network firewalls, web proxies, and web caches. As a result, simple web browsing can result in a plethora of records being created on machines in locations that are controlled by multiple organizations.
Log files are under the control of the person or organization that controls the web server. Log files are frequently subpoenaed and used in lawsuits or criminal investigations. Log files can be used by employers to determine what employees are doing when they are at work. Log files can be used by a nosy system administrator to spy on others. But in the vast majority of cases, the information in log files is never looked at by anybody. Because most log files are never consulted, and because the contents of most log files are never revealed, most users of the Internet do not know the full extent of their activities are recorded.
What's in a web log?
The following information is either stored directly in most web log files or can be readily inferred from other information in web logs:
- The name and IP address of the computer that downloaded the web page.
- The time of the request.
- The URL that was requested.
- The time it took to download the file (this is an indication of the user's Internet connection).
- If HTTP authentication was used, the log file contains the username of the person who downloaded the file.
- Any errors that occurred.
- The previous web page that was downloaded by the web browser (called the refer link).
- The kind of web browser that was used.
This information can be combined with other log files--such as login/logout information from Internet service providers, or logs from mail servers--to discover the actual identity of the person who was doing the downloading. Normally this kind of cross-correlation requires the assistance of another organization, but that is not always the case.
For example, many ISPs dynamically assign IP addresses to computers each time they call up. A web server may know that a user accessed a page from the host free-dial-77.freeport.mwci.net ; someone would then have to go to mwci.net 's log files to find out who the actual user was. On the other hand, sometimes computers are assigned permanent IP addresses; for several years, Simson used a computer named pc-slg.vineyard.net and Spaf would routinely check his email while on the road dialed in from shire-ppp.cs.purdue.edu.
A typical web server log is shown in Example 8-1.
Example 8-1: A sample web server log
free-dial-77.freeport.mwci.net - - [09/Mar/1997:00:04:11 -0500] "GET /awa/issue2/Woodstock.gif HTTP/1.0" 200 26385(compatible; MSIE 3.01; Windows 95)" ""
free-dial-77.freeport.mwci.net - - [09/Mar/1997:00:04:27 -0500] "GET /awa/issue2/WoodstockWoodcut.gif HTTP/1.0" 200 54467
"http://www.vineyard.net/awa/issue2/Wood.html" "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)" ""
crawl4.atext.com - - [09/Mar/1997:00:04:30 -0500] "GET /org/mvcc/ HTTP/1.0" 200 10768 "-" "ArchitextSpider" ""
www-as6.proxy.aol.com - - [09/Mar/1997:00:04:34 -0500] "GET /cgi-bin/imagemap/mvol/cat2.map?31,39 HTTP/1.0" 302 - "http://www.mvol.com/" "Mozilla/2.0 (Compatible; AOL-IWENG 3.0; Win16)" ""
www-as6.proxy.aol.com - - [09/Mar/1997:00:04:40 -0500] "GET /mvol/photo.html HTTP/1.0" 2006801
"http://www.mvol.com/" "Mozilla/2.0 (Compatible; AOL-IWENG 3.0; Win16)" ""
www-as6.proxy.aol.com - - [09/Mar/1997:00:04:48 -0500] "GET /mvol/photo2.gif HTTP/1.0" 20012748
"http://www.mvol.com/" "Mozilla/2.0 (Compatible; AOL-IWENG 3.0; Win16)" ""
free-dial-77.freeport.mwci.net - - [09/Mar/1997:00:05:07 -0500] "GET /awa/issue2/Wood.html HTTP/1.0" 200 37016
"http://www.altavista.digital.com/cgi-bin/query?pg=q&what=web&fmt=.&q=woodstock" "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)" ""
free-dial-77.freeport.mwci.net - - [09/Mar/1997:00:05:07 -0500] "GET /awa/issue2/Sprocket1.gif HTTP/1.0" 200 4648
"http://www.vineyard.net/awa/issue2/Wood.html" "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)" ""
free-dial-77.freeport.mwci.net - - [09/Mar/1997:00:05:08 -0500] "GET /awa/issue2/Sprocket2.gif HTTP/1.0" 200 5506
"http://www.vineyard.net/awa/issue2/Wood.html" "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)" ""
www-as6.proxy.aol.com - - [09/Mar/1997:00:05:09 -0500] "GET /mvol/peter/index.html HTTP/1.0" 200 891 "http://www.vineyard.net/mvol/photo.html" "Mozilla/2.0 (Compatible; AOL-IWENG 3.0; Win16)" ""
The refer link field
The refer link field is another source of privacy violations. It works like this: whenever you, as a web surfer, look for a new page, one of the pieces of information that is sent along is the URL of the page that you are currently looking at. (The HTTP specification says that sending this information should be an option left up to the user to decide, but we have never seen a web browser where sending the refer information is optional.)
One of the main uses that companies have found for the refer link is to gauge the effectiveness of advertisements they purchase on other web sites. Another use is charting how customers move through a site. The refer link field can also reveal personal information--namely, the URL of the page that a user was looking at before he or she clicked into your site.
Refer links frequently reveal unintended information. When you click the link of a web search engine, for instance, the refer link that is sent to the remote web server encodes the search that you were performing. Consider this entry from the log file of the www.simson.net web server:
pc109240.stofanet.dk - - [21/Mar/2001:16:27:25 -0500] "GET /clips/95.SJMN.AltKeyboards.txt HTTP/1.1" 200 9988 "http://www.google.com/search?hl=da&safe=off&q=%22Building+a+better +keyboard+%22&lr=" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)"
This log file entry indicates that the user of the computer pc109240.stofanet.dk was searching on the web search engine Google for the phrase "Building a better keyboard" on March 21, 2001. Sometimes, the results of a refer field can give away far more information than the web user might wish.
As is the case with Google, the largest number of privacy violations involving refer fields occur with HTML forms that use the GET method (as opposed to the POST method). This is because the GET method encodes the contents of each field in the URL itself. The big advantage of using the GET method is that it allows people to bookmark filled-in-forms, such as searches. For example, opening the URL http://www.google.com/search?q=simson will automatically perform a Google search for the name "Simson." But if the previous web page posted contained a credit card number or other personal information that was provided to a GET form, information leakage from one web site to another web site is inevitable.
Obscuring web logs
Proxy servers can render web logs less useful. When a user accesses a web server through a proxy, the web server records the proxy's address, rather than the address of the user's machine. For example, most users who access the Internet through America Online do so through the company's proxy server.
Web proxies do not necessarily give web users anonymity: the user's identity can still be learned by referring to the proxy's logs. Proxies simply make the task more difficult.
RADIUS (Remote Authentication Dial-In User Service) is widely used on the Internet by ISPs and large organizations to validate usernames/passwords for dialup users and to provide for proper accounting. Originally designed by Livingston, RADIUS is now widely implemented by Cisco, Nortel, Lucent, Redback, and most other vendors. Although RADIUS provides functionality that is similar to Cisco's proprietary TACACS and TACACS+ protocols, RADIUS became the dominant protocol because clients and servers were distributed in source-code form, because it was extensible, and because it provided for encryption of passwords sent over the wire (unlike TACACS).
RADIUS log files contain an astonishing amount of information, including usernames, times, IP addresses, and even CALLER-ID information. Here are two example RADIUS records that were created with a Livingston Portmaster 3:
Thu Apr 19 13:54:09 2001
Acct-Session-Id = "0E027BE9"
User-Name = "beth"
NAS-IP-Address = 18.104.22.168
NAS-Port = 43
NAS-Port-Type = Async
Acct-Status-Type = Start
Acct-Authentic = RADIUS
Connect-Info = "50666 LAPM/V42BIS"
Called-Station-Id = "5086292329"
Calling-Station-Id = "5086962222"
Service-Type = Framed
Framed-Protocol = PPP
Framed-IP-Address = 22.214.171.124
Acct-Delay-Time = 0
Thu Apr 19 13:54:52 2001
Acct-Session-Id = "0E027BD6"
User-Name = "simson"
NAS-IP-Address = 126.96.36.199
NAS-Port = 34
NAS-Port-Type = Async
Acct-Status-Type = Stop
Acct-Session-Time = 2350
Acct-Authentic = RADIUS
Connect-Info = "14400 LAPM/V42BIS"
Acct-Input-Octets = 18321
Acct-Output-Octets = 108087
Called-Station-Id = "6173442329"
Calling-Station-Id = "6178761111"
Acct-Terminate-Cause = Idle-Timeout
Vendor-Specific = vLivingston-020e49646c652054696d656f7574
Service-Type = Framed
Framed-Protocol = PPP
Framed-IP-Address = 188.8.131.52
Acct-Delay-Time = 0
The CALLER-ID information in RADIUS logs was instrumental in determining the identity of the author of the Melissa computer worm.
Every time an email message is sent, received, or transported through a mail server, there is a good chance that some program somewhere is making note of that fact in a mail log file. Mail logs usually contain the
to:email addresses, the time that the message was sent, and the
Subject:lines and content are usually not logged, although they certainly could be.
Here is an example of a mail log:
Apr 20 11:43:42 <mail.info> r2 sendmail: f3KGhg150422: from=<email@example.com>, size=2468, class=-60, nrcpts=1, msgid=<firstname.lastname@example.org>, proto=ESMTP, daemon=Daemon0, relay=groucho.ctel.net [184.108.40.206]
Apr 20 11:43:42 <mail.info> r2 sendmail: f3KGhg150422: to=<email@example.com>, delay=00:00:00, xdelay=00:00:00, mailer=local, pri=139359, relay=local, dsn=2.0.0, stat=Sent
Apr 20 11:43:54 <mail.info> r2 sendmail: f3KGhs150426: from=<firstname.lastname@example.org>, size=1138, class=0, nrcpts=1, msgid=<Pine.GSO.email@example.com>, proto=ESMTP, daemon=Daemon0, relay=mc.lcs.mit.edu [220.127.116.11]
Apr 20 11:43:58 <mail.info> r2 sendmail: f3KGhs150426: to=<firstname.lastname@example.org>, delay=00:00:04, xdelay=00:00:04, mailer=local, pri=30456, relay=local, dsn=2.0.0, stat=Sent
Apr 20 11:44:13 <mail.info> r2 sendmail: f3KGiC150432: from=<email@example.com>, size=4303, class=-60, nrcpts=1, msgid=<200104201642.NAA18970@kiln.isn.net>, proto=ESMTP, daemon=Daemon0, relay=groucho.ctel.net [18.104.22.168]
Apr 20 11:44:13 <mail.info> r2 sendmail: f3KGiC150432: to=<firstname.lastname@example.org>, delay=00:00:01, xdelay=00:00:00, mailer=local, pri=141458, relay=local, dsn=2.0.0, stat=Sent
Mail logs are useful for determining people who exchange email and users on mailing lists. (In the example above, the user "beth" is evidently on the mailing list email@example.com.)
The bind DNS nameserver produced by the Internet Software Consortium can be configured to log every DNS query that it receives. The bind log file contains the name of the host from which each query was made, the IP address from which the query was made, and the query itself. An example of such a log file is shown here:
Apr 20 13:18:17 <local2.info> r2 named: XX /22.214.171.124/queen.simson.net/A/IN
Apr 20 13:18:20 <local2.info> r2 named: XX+/126.96.36.199/188.8.131.52.in-addr. arpa/PTR/IN
Apr 20 13:18:20 <local2.info> r2 named: XX+/184.108.40.206/220.127.116.11.in-addr.arpa/PTR/IN
Apr 20 13:18:20 <local2.info> r2 named: XX+/18.104.22.168/groucho.ctel.net/A/IN
Apr 20 13:18:20 <local2.info> r2 named: XX+/22.214.171.124/groucho.ctel.net/ANY/IN
Apr 20 13:18:20 <local2.info> r2 named: XX+/126.96.36.199/walden.cambridge.ma.us/ ANY/IN
Apr 20 13:18:21 <local2.info> r2 named: XX+/188.8.131.52/ctel.net/ANY/IN
Apr 20 13:18:21 <local2.info> r2 named: XX+/184.108.40.206/earthlink.net/ANY/IN
Apr 20 13:18:36 <local2.info> r2 named: XX /220.127.116.11/queen.simson.net/A/IN
Apr 20 13:18:36 <local2.info> r2 named: XX /18.104.22.168/www.dbz.ex.com/A/IN
Logging DNS queries can be useful for system maintenance and forensic work. It is also a great way to silently monitor the activities of customers or other individuals. Because a computer must resolve a DNS address before a URL that contains a hostname can be resolved, monitoring a user's DNS usage provides an ISP with a detailed report of each web site that the user accesses. Monitoring DNS queries can also give pointers to attackers, as even attackers who launch their attacks from third-party machines frequently perform DNS queries from their home machines first.
A cookie is a block of ASCII text that a web server can pass into a user's instance of Netscape Navigator (and many other web browsers). Once received, the web browser sends the cookie every time a new document is requested from the web server. Cookies are transmitted by the underlying HTTP protocol, which means that they can be sent with HTML files, images (GIFs, JPEGs, and PNGs), sounds, or any other data type.
Netscape introduced "cookies" with Navigator Version 2.0. The original purpose of cookies was to make it possible for a web server to track a client through multiple HTTP requests. This sort of tracking is needed for complex web-based applications that need to maintain state between web pages.
Typical applications for cookies include the following:
- A catalog site might use a cookie to implement an electronic "shopping cart."
The preliminary cookie specification can be found at http://www.netscape.com/newsref/std/cookie_spec.html. RFC 2965, dated October 2000, outlines a proposed codification of the cookie specification, but as of August 2001 this standard had still not been adopted by the IETF.
The Cookie Protocol
Here is a sample Set-Cookie header:
Set-Cookie: comics=broomhilda+foxtrot+garfield; path=/comics; domain=.comics.net; [secure]
The Set-Cookie header contains a series of
name=valuepairs that are encoded according to the HTTP specification for encoding URLs. The previous example contains a single
name=valuefield that sets the name
comicsto be the value
garfield." There are some special values:
- Specifies the time when the cookie will expire. If no expiration time is provided, then the cookie is not written to the computer's hard disk, and it lasts only as long as the current session.
- Specifies which computers will be sent the cookie. Normally, cookies will only be sent back to the computer that first sent the cookie to the user. In this example, the cookie will be sent to any host in the
comics.netdomain. If the domain is left blank, the domain is assumed to be the same as the domain for the web server that provided the cookie.
- Controls which of the references will trigger the sending of the cookie. If
pathis not specified, the cookie will be sent for all HTTP transmissions to the web site. If path=/directory, then the cookie will only be sent when the pages underneath /directory are referenced. In this example, the cookies will be sent to any URL that is underneath the /comics/ directory.
- If the word
secureis provided as part of the Set-Cookie header, then the cookie can only be transmitted via SSL. (Don't depend on this facility to keep the contents of your cookies private, as they are still stored unencrypted on the hard disk.)
Once a browser has a cookie, that cookie is transmitted by the browser with every successive request to the remote web site. For example, if the previous cookie was loaded into a browser and the browser attempted to fetch the URL
http://www.comics.net/index.html, the following HTTP headers could be sent to the remote site:
GET /index.html HTTP/1.0
Here is an actual HTTP header sent by the site www.hotbot.com at 8:10 a.m. on April 21, 2001:
HTTP/1.1 200 OK
Date: Sat, 21 Apr 2001 12:05:56 GMT
Set-Cookie: lubid=01000008C73351C5086C3AE177A40000351200000000; expires=Mon, 18-Jan-2038 08:00:00 GMT; domain=.lycos.com; path=/
Set-Cookie: p_uniqid=aD3QMJX/K93Z; expires=Fri, 21-Dec-2012 08:00:00 GMT; domain=; path=/
Set-Cookie: remotehost=secondary=chi%2Emegapath&top=net; expires=Mon, 21-May-2001 07:00:00 GMT; path=/
Set-Cookie: HB%5FSESSION=BT=lowend&BA=false&VE=& PL=Unknown&MI=u&BR=Unknown&MA=0&BC=1; path=/
The HotBot site sends four cookies, shown in Table 8-1.
Table 8-1: Cookies sent by www.hotbot.com at 8:10 a.m. EST on April 21, 2001
18-Jan-2038 08:00:00 GMT
21-Dec-2012 08:00:00 GMT
Cookie #1 assigns a user tracking identifier to the web browser. Many web sites use such cookies to determine the number of unique visitors that they recover every month. Notice that although this cookie was downloaded from the site
www.hotbot.com, its domain is set to
.lycos.com. This cookie is what is called a third-party cookie. HotBot is a business unit of Lycos; this cookie allows Lycos to identify which Lycos users are also HotBot users. This type of cross-site cookie is permitted by some browsers but prohibited by others.
Cookie #2 is another user tracking cookie, but this one is solely for the HotBot site.
The purposes of Cookie #3 and Cookie #4 cannot immediately be determined from inspection. We contacted Lycos, Hotbot's owner, to find out the purpose of these cookies. We were pointed at FAQs about how to disable cookies, but after several months of trying, we were unable to discover their actual purpose.
Broadly speaking, there are two ways that a web site can implement cookies:
- The web site can use the cookie to contain the user's actual data.
- The cookie can simply contain a number of codes that key into a database that resides at the web provider.
Examples of these two approaches are shown in Table 8-2.
Table 8-2: Schematic views of cookies that contain customer data versus those that merely point to a database
Purpose of cookie
Possible contents for an implementation that keeps data on the user's computer
Possible contents for an implementation that keeps data on the provider's computer
Provide customized weather reports and local news for a web site.
Implement a shopping cart
Provide sign-on to a web site
Cookies were originally envisioned as a place on the client where web servers could store user preferences and personal information. This way, no personal information would need to be stored on the client. But as the cookies from the HotBot web site show, today one of the most popular uses of cookies is to give a permanent identification number to each user so that the number of "unique visitors" to a web site can be measured. These numbers can be very important when a company is attempting to sell advertising space on its web site.
Cookies and Privacy
Cookies can be used to improve privacy or to weaken it. Unfortunately, it is very difficult to tell when a cookie is being used for one purpose and when it is used for another.
Cookies can significantly weaken personal privacy when they are used to tie together a whole set of seemingly unconnected facts and pieces of information from different web sites to create an electronic fingerprint of a person's online activities. Cookies like this usually contain a single identifier. This identifier is a key into a database. The cookie for Doubleclick in Example 8-2 is typical of such a cookie.
Cookies can also be used to improve privacy by eliminating the consolidation of personal information. Instead of storing the information in a central location, these cookies store a person's preferences in the cookie itself. For example, a web site might download a cookie into a person's web browser that records whether the person prefers to see web pages with a red background or with a blue background. A web site that offers news, sports, and financial information could use a cookie to store the user's preferred front page.
The cookie from the DigiCrime web site is this sort of privacy-protecting cookie:
www.digicrime.com FALSE FALSE 942189160 DigiCrime virus=1
This cookie tracks the number of times that the user has visited the DigiCrime web site without necessitating the creation of a large user tracking database on the DigiCrime site itself. Each time you visit the DigiCrime web site, the virus cookie is incremented. The web site has different behavior when the "virus" counter reaches different ordinals.
Keeping information about a user in a cookie, rather than in a database on the web server, means that it is not necessary to track sessions: the server can become essentially stateless. And there is no need to worry about expiring the database entries for people who clicked into the web site six months ago and haven't been heard from since. Perhaps most importantly, there is no database of personal information that needs to be protected.
Unfortunately, using cookies this way takes a lot of work and thoughtful programming. It's much simpler to hurl a cookie with a unique ID at somebody's browser and then index that number to a relational database on the server. For one thing, this makes it simpler to update the information contained in the database because there is no requirement to be able to read and decode the format of old cookies.
Cookies allow advertisers to have a great deal of control over the advertisements that each user sees, regardless of the actual web site that a person is visiting. For example, using cookies, an advertiser can assure that each person will only see a particular Internet advertisement once (unless the advertiser pays for repeat exposure, of course). Cookies can be used to display a sequence of advertisements to a single user, even if they are jumping around among different pages on different web sites. Cookies allow users to be targeted by area of interest. Advertisers can further tailor advertisements to take into account the query terms that web surfers use.
All cookies are open to examination. Unfortunately, it can be very difficult to determine what cookies are used for by merely examining them, as the cookies in Table 8-1 demonstrate.
Cookies are kept in the web browser's memory. If a cookie is persistent (that is, it has an expiration date), the cookie is also saved by the web browser on the computer's hard drive.
Netscape Navigator and Internet Explorer store cookies in different way. Navigator stores cookies in a single file called cookies.txt, which can be found in the user's preference directory. (On Unix systems, Navigator stores cookies in the ~/.netscape/cookies file.)
A sample Netscape cookies file is shown in Example 8-2.Example 8-2: A sample Netscape cookies file
# Netscape HTTP Cookie File
# This is a generated file! Do not edit.
.techweb.com TRUE /wire/news FALSE 942169160 TechWeb 22.214.171.124.852255600 path=/
.hotwired.com TRUE / FALSE 946684799 p_uniqid yQ63oN3ALxO1a73pNB
.talk.com TRUE / FALSE 946684799 p_uniqid y46RXMoBwFwD16ZFTA
.packet.com TRUE / FALSE 946684799 p_uniqid y86ijMoA9MhsGhluvB
.boston.com TRUE / FALSE 946684799 INTERSE stl-mo8-10.ix.netcom.com20748850376179639
.netscape.com TRUE / FALSE 1609372800 MOZILLA MOZ-ID=DFJAKGLKKJRPMNX[-]MOZ_VERS=1.2[-]MOZ_FLAG=2[-]MOZ_TYPE=5[-] MOZ_CK=AJpz085+6OjN_Ao1[-]
.netscape.com TRUE / FALSE 1609372800 NS_IBD IBD_ SUBSCRIPTIONS=INC005|INC010|INC017|INC018|INC020|INC021|INC022|INC034|INC046
www.xmission.com FALSE / FALSE 946511999 RoxenUserID 0x7398
ad.doubleclick.net FALSE / FALSE 942191940 IAF 22348bb
.focalink.com TRUE / FALSE 946641600 SB_ID ads01.28425853273216764786
gtplacer.globaltrack.com FALSE / FALSE 942105660 gtzopyid 85317245
.netscape.com TRUE / FALSE 1585744496 REG_DATA C_DATE_REG=13:06:51.304128 01/17/97[-]C_ATP=1[-]C_NUM=0[-]
www.digicrime.com FALSE FALSE 942189160 DigiCrime virus=1
Internet Explorer saves each cookie in an individual file. The files are stored in the directory referenced by the Registry name Cookies, in the key \HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explorer\User Shell Folders. This directory is C:\Windows\Cookies on Windows 95/98/ME systems configured for a single user, or in the directory C:\Windows\Profiles\username\Cookies on Windows 95/98/ME systems configured for multiple users (see Figure 8-3). A sample Internet Explorer Cookies file is shown in Example 8-3.
RFC 2109 on Cookies
RFC 2109 describes the HTTP state management system (i.e., cookies). According to the RFC, any web browser that implements cookies should provide users with at least the following controls:
- The ability to completely disable the sending and saving of cookies.
- An (preferably visual) indication as to whether cookies are in use.
- A means of specifying a set of domains for which cookies should or should not be saved.
Example 8-3: The contents of an Internet Explorer Cookies file.
Figure 8-3. Internet Explorer stores cookies in files in the Cookies directory. You can delete a cookie by clicking on the cookie with the mouse and hitting the "Delete" key.
Users can modify the contents of their cookies. For this reason, a web site should always regard a cookie's contents as potentially suspect. If the cookie is used to gain access to information that might be considered private, confidential, or sensitive, then measures should be built into the cookie so that a modified cookie will not be accepted by the web application.
Consider the following two hypothetical cookies. Both of these cookies belong to a hypothetical web site that allows a consumer to view stored transactions. The cookies give the consumer access by providing the consumer's identification number to the web application server. The first cookie is not a secure cookie. The second cookie may be secure, as we will explain.
- Cookie #1
- Cookie #2
In the first cookie, the consumer's identification number is simply "4531." Presumably, these identification numbers are being assigned in a sequential order. If the consumer were to edit his or her cookie file and change the number from "4531" to another number, like "4533," it is quite probable that the consumer would then have access to another consumer's order information. Essentially, the first consumer can easily create counterfeit cookies!
A consumer visiting a web site that uses the second cookie can change his identification number as well. However, a consumer changing "34343339336" to another number is likely to be less successful than a consumer changing the number "4531." This second web site almost certainly does not assign its identification numbers sequentially; there are not 34,343,339,336 Internet users (yet)! So a consumer making a change to this second cookie is unlikely to accidentally hit upon a valid identification number belonging to another consumer.
To create the most secure cookies, some web sites use digital signatures or cryptographic MAC codes. Such techniques make it exceedingly unlikely that a consumer will be able to create a counterfeit cookie, provided that the MAC actually covers all of the information in the cookie, rather than the data in the fields after they are decoded. More information on creating cookies that are really secure can be found in Chapter 16.TIP: Some web sites are set up so that if you have a cookie, you are given unrestricted access to your account information. Other web sites are set up so that even if you have a cookie, you must still type a password to gain access to your confidential information. In general, web sites that require a password to be typed are more secure. This is because your cookie can easily end up on somebody else's machine--for example, if you check your account information using a friend's computer. If you are a web developer, you should never make the mistake of thinking that cookies are secure.
Both Netscape Navigator and Internet Explorer have options that will allow you to be notified when a cookie is received. Current versions of these programs allow you to accept all cookies, reject all cookies, or be prompted for each cookie whether you wish to accept it or not. Newer versions of these browsers allow you to control cookie acceptance on a site-by-site basis. Netscape 6.0 allows you to delete cookies on a case-by-case basis, as shown in Figure 8-4.
Unfortunately, neither browser will let you disable the sending of cookies that have already been accepted. To do that, you must toss your cookies.
Figure 8-4. Netscape 6.0's Cookie Manager allows cookies to be controlled on a site-by-site basis
There are additional techniques that you can use to block cookies. These techniques work with all browsers, whether they have cookie control or not.
- Under Unix-based systems, users can delete the cookies file and replace it with a link to /dev/null. On Windows systems, the file can be replaced with a zero-length file with permissions set to prevent reading and writing. On a Macintosh you can replace the file with a locked, zero-length file or folder.
- Alternatively, you can simply accept the cookies you wish and then make the cookies file read-only. This will prevent more cookies from being stored inside.
- You can disable cookies entirely by patching the binary executable for your copy of Netscape Navigator or Internet Explorer. Search for the string
Set-Cookieand change it to
Set-Fookie. It's unlikely that anyone will be sending you any Fookies, so that should be sufficient.
Filter programs, such as AdSubtract, can also give users control over cookies. For further information, see Chapter 10.
In September 2000, the Colorado-based Privacy Foundation released a report on a new technology for monitoring Internet users, which the foundation called web bugs. Although the use of this technology had been widely known in advertising circles, it had not previously been publicized to the larger community of Internet users.
Web bugs are small graphic images placed on web pages or in email messages to facilitate third-party tracking of users and collection of statistics. A typical web bug consists of a 1-pixel-by-1-pixel transparent GIF, making it invisible to the unassisted eye. To see a web bug, you must view the source of an HTML page or email message.
According to the foundation's Web Bug FAQ:The word bug is being used to denote a small, eavesdropping device. It is not a euphemism for a programming error.... Rather than the term `Web bugs,' the Internet advertising community prefers the more sanitized term `clear GIF.' Web bugs are also known as `1-by-1 GIFs,' `invisible GIFs,' and `beacon GIFs.'
Web Bugs on Web Pages
Here are two web bugs that the Privacy Foundation found on the Intuit's home page for Quicken.COM:
width=1 height=1 border=0>
<IMG WIDTH=1 HEIGHT=1 border=0
SRC="http://media.preferences.com/ping?ML_SD=IntuitTE_Intuit_1x1_RunOfSite_Any& db_afcr=4B31-C2FB-10E2C&event=reghome&group=register&time=19126.96.36.199.5 6.37">
The first web bug causes a single 1 X 1 image to be fetched from the Doubleclick advertising server ad.doubleclick.net. This bug alerts Doubleclick to each individual that views the Quicken.COM home page. Doubleclick has built a sophisticated system for monitoring individuals who view Doubleclick's advertisements; this web bug allows Intuit to use Doubleclick's monitoring system without the need to first show a banner advertisement.
The second web bug fetches a 1 X 1 image from the MatchLogic media.preferences.com server. This bug is slightly more interesting in that it apparently sends to MatchLogic a unique user identification, similar to what might be found in a cookie. This web bug might allow Intuit and MatchLogic to knit together their two disparate user databases.
Using two web bugs allows Intuit to compare Doubleclick's tracking and monitoring results with those of MatchLogic. Both, of course, can also be compared with the results that Intuit gets from analyzing its own log files.
Web bugs do not need to be 1 X 1 pixel graphics. Any image or other content that is pulled from a third-party web server can be used by a web site to monitor its users. Mainly, web bugs are a form of outsourced web site monitoring. They impact privacy by introducing a third party into a consumer web site relationship. Potentially, web bugs also allow movements between multiple web sites to be correlated, although the same can be done through banner advertisements or by the sharing of log files.
Web bugs can be placed on any piece of HTML. For example, the Privacy Foundation created a Yahoo user called webbug2000 and placed a web bug in the fictitious user's Yahoo Profile (see Figure 8-5).
Figure 8-5. A Yahoo profile that was bugged with a web bug by the Privacy Foundation
Guidelines for Using Web Bugs
The Privacy Foundation presented these guidelines on September 13, 2000 at the Global Privacy Summit. A copy can be found on the Privacy Foundation's web site, at http://www.privacyfoundation.org/privacywatch/report.asp?id=40&action=0#guide.
- A web bug should be a visible icon on the screen. A web bug icon can be located anywhere on the page as long as it can be easily spotted.
- The icon should identify the name of the company that placed the web bug on the page. The company name should be incorporated in the web bug icon and be easily read. In addition, the icon should be labeled to say it is a monitoring device. Common terms are "tracker," "spotlight," or "sensor."
- By clicking on the icon, a user should receive a web bug disclosure, including:
- What data is collected with the web bug
- How the data is used after it is collected
- What company or companies receive the data
- What other data the web bug is combined with
- If a cookie is associated with the web bug
- Users should be able to "opt-out" from any data collection done by web bugs. The "opt-out" should be made available to users on the web bug disclosure.
- Web bugs should not be used to collect information from web pages dealing with sensitive information. Examples may include pages:
- Intended for children
- About medical issues
- About financial and job matters
- About sexual matters
Web Bugs in Email Messages and Word Files
Web bugs can be used in HTML email messages to determine whether a person reads an email. When the email message is viewed, the web bug is fetched from the remote server. If each web bug is given a unique identifier and causes a cookie to be downloaded, then an email-based web bug can also be used to determine if an email message is forwarded from one person to another.TIP: Email-based web bugs are only active if the email message is read with a mail client that can display HTML messages, and even then, only if that computer is connected to the Internet.
Here's an example of two email-based web bugs the Privacy Foundation discovered:
<img width='1' height='1'
src="http://www.m0.net/m/logopen02.asp?vid=3&catid=370153037& email=SMITHS%40tiac.net" alt=" ">
Web bugs can be placed in HTML Usenet messages to determine how many times a Usenet message is viewed.
Web bugs can also be placed in Microsoft Word files. This is possible because Microsoft Word allows images downloaded from web pages to be pasted directly into Word documents. Each time the Word document is opened, the image is downloaded anew. This in turn allows the web bug to track the usage and the movement of the Word document.
Uses of Web Bugs
According to the Privacy Foundation, companies use web bugs to accomplish the following tasks:
- Gather viewing and usage statistics for a particular page.
- Correlate usage statistics between multiple web sites.
- Profile users of a web site by gender, age, Zip code, and other demographics.
- Transfer personally identifiable information from the web site directly to an Internet marketing company. This transfer would be accomplished with a web bug URL that contains the personal information that the company wishes to transfer.
- Transfer search strings from a search engine to a marketing company.
- Verify the statistics reported by a banner advertising company, to gauge the effectiveness of different banner advertisements.
- Have third-party providers prepare web usage statistics for web sites that do not have the technical capability to prepare their own statistics.
- "Cookie sync," which is for synchronizing personal information in two different databases.
- Check if email messages are actually read and, if they are read, to see if they are forwarded.
- Detect copyright infringement.
In this chapter, we explored a number of techniques and technologies that have been developed for tracking users on the Internet. In the next chapter, we will examine techniques that you can use for protecting your privacy. Then in Chapter 10, we'll look at technologies you can use for fighting back.
2. Samuel Warren and Louis Brandeis, "The Right of Privacy," Harvard Law Review 4 (1890), 193. It's at http://www.lawrence.edu/fac/boardmaw/Privacy_brand_warr2.html. The right to privacy is not without limit. Warren and Brandeis made clear exceptions for the distribution and publication of court records. They also wrote that the right to privacy ceases once facts about an individual are published by that person or with his consent.
3. Westin, Alan. Privacy and Freedom, Atheneum Press, Boston, 1967.
5. This example assumes an even distribution of birthdays throughout the year and people throughout Zip codes, which is a simplification, but not a very big one.
6. The risk of transferring credit card numbers to third-party sites was reduced somewhat in 1997, when Netscape and Microsoft modified their browsers so that the refer link would no longer be passed from an SSL-enabled site to a non-SSL site.
7. Remember, the HTTP URL encoding mechanism converts spaces to plus signs (+).
Back to: Web Security, Privacy & Commerce, 2nd Edition
© 2001, O'Reilly & Associates, Inc.