BUY THIS BOOK
Add to Cart

Print Book $39.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint this content?


Web Caching
Web Caching

By Duane Wessels
Price: $39.95 USD
£28.50 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction
The term cache has French roots and means, literally, to store. As a data processing term, caching refers to the storage of recently retrieved computer information for future reference. The stored information may or may not be used again, so caches are beneficial only when the cost of storing the information is less than the cost of retrieving or computing the information again.
The concept of caching has found its way into almost every aspect of computing and networking systems. Computer processors have both data and instruction caches. Computer operating systems have buffer caches for disk drives and filesystems. Distributed (networked) filesystems such as NFS and AFS rely heavily on caching for good performance. Internet routers cache recently used routes. The Domain Name System (DNS) servers cache hostname-to-address and other lookups.
Caches work well because of a principle known as locality of reference. There are two flavors of locality: temporal and spatial. Temporal locality means that some pieces of data are more popular than others. CNN's home page is more popular than mine. Within a given period of time, somebody is more likely to request the CNN page than my page. Spatial locality means that requests for certain pieces of data are likely to occur together. A request for the CNN home page is usually followed by requests for all of the page's embedded graphics. Caches use locality of reference to predict future accesses based on previous ones. When the prediction is correct, there is a significant performance improvement. In practice, this technique works so well that we would find computer systems unbearably slow without memory and disk caches. Almost all data processing tasks exhibit locality of reference and therefore benefit from caching.
When requested data is found in the cache, we call it a hit. Similarly, referenced data that is not cached is known as a miss. The performance improvement that a cache provides is based mostly on the difference in service times for cache hits compared to misses. The percentage of all requests that are hits is called the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Web Architecture
Before we can talk more about caching, we need to agree on some terminology. Whenever possible, I use words and meanings taken from Internet standards documents. Unfortunately, colloquial usage of web caching terminology is often just different enough to be confusing.
The fundamental building blocks of the Web (and indeed most distributed systems) are clients and servers. A web server manages and provides access to a set of resources. The resources might be simple text files and images, or something more complex, such as a relational database. Clients, also known as user agents, initiate a transaction by sending a request to a server. The server then processes the request and sends a response back to the client.
On the Web, most transactions are download operations; the client downloads some information from the server. In these cases, the request itself is quite small (about 200 bytes) and contains the name of the resource, plus a small amount of additional information from the client. The information being downloaded is usually an image or text file with an average size of about 10,000 bytes. This characteristic of the Web makes cable- and satellite-based Internet services viable. The data rates for receiving are much higher than the data rates for sending because web users mostly receive information.
A small percentage of web transactions are more correctly characterized as upload operations. In these cases, requests are relatively large and responses are very small. Examples of uploads include sending an email message and transferring an image file from your computer to a server.
The most common web clients are called browsers. These are applications such as Netscape Navigator and Microsoft Internet Explorer. The purpose of a browser is to render the web content for us to view and interact with. Because of the myriad of features present in web browsers, they are really very large and complicated programs. In addition to the GUI-based clients, there are a few simple command-line client programs, such as Lynx and Wget.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Web Transport Protocols
Clients and servers use a number of different transport protocols to exchange information. These protocols, built on top of TCP/IP, comprise the majority of all Internet traffic today. The Hypertext Transfer Protocol (HTTP) is the most common because it was designed specifically for the Web. A number of legacy protocols, such as the File Transfer Protocol (FTP) and Gopher, are still in use today. According to Merit's measurements from the NSFNet, HTTP replaced FTP as the dominant protocol in April of 1995. Some newer protocols, such as Secure Sockets Layer (SSL) and the Real-time Transport Protocol (RTP), are increasing in use.
Tim Berners-Lee and others originally designed HTTP to be a simple and lightweight transfer protocol. Since its inception, HTTP has undergone three major revisions. The very first version, retroactively named HTTP/0.9, is extremely simple and almost trivial to implement. At the same time, however, it lacks any real features. The second version, HTTP/1.0 [Berners-Lee, Fielding and Frystyk, 1996], defines a small set of features and still maintains the original goals of being simple and lightweight. However, at a time when the Web was experiencing phenomenal growth, many developers found that HTTP/1.0 did not provide all the functionality they required for new services.
The HTTP Working Group of the Internet Engineering Task Force (IETF) has worked long and hard on the protocol specification for HTTP/1.1. New features in this version include persistent connections, range requests, content negotiation, and improved cache controls. RFC 2616 is the latest standards track document describing HTTP/1.1. Unlike the earlier versions, HTTP/1.1 is a very complicated protocol.
HTTP transactions use a well-defined message structure. A message, which can be either a request or a response, has two parts: the headers and the body. Headers are always present, but the body is optional. Headers are represented as ASCII strings terminated by carriage return and linefeed characters. An empty line indicates the end of headers and the start of the body. Message bodies are treated as binary data. The headers are where we find information and directives relevant to caching.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Cache the Web?
The short answer is that caching saves money. It saves time as well, which is sometimes the same thing if you believe that "time is money." But how does caching save you money?
It does so by providing a more efficient mechanism for distributing information on the Web. Consider an example from our physical world: the distribution of books. Specifically, think about how a book gets from publisher to consumer. Publishers print the books and sell them, in large quantities, to wholesale distributors. The distributors, in turn, sell the books in smaller quantities to bookstores. Consumers visit the stores and purchase individual books. On the Internet, web caches are analogous to the bookstores and wholesale distributors.
The analogy is not perfect, of course. Books cost money; web pages (usually) don't. Books are physical objects, whereas web pages are just electronic and magnetic signals. It's difficult to copy a book, but trivial to copy electronic data.
The point is that both caches and bookstores enable efficient distribution of their respective contents. An Internet without caches is like a world without bookstores. Imagine 100,000 residents of Los Angeles each buying one copy of Harry Potter and the Sorcerer's Stone from the publisher in New York. Now imagine 50,000 Internet users in Australia each downloading the Yahoo! home page every time they access it. It's much more efficient to transfer the page once, cache it, and then serve future requests directly from the cache.
In order for caching to be effective, the following conditions must be met:
  • Client requests must exhibit locality of reference.
  • The cost of caching must be less than the cost of direct retrieval.
We can intuitively conclude that the first requirement is true. Certain web sites are very popular. Classic examples are the starting pages for Netscape and Microsoft browsers. Others include searching and indexing sites such as Yahoo! and Altavista. Event-based sites, such as those for the Olympics, NASA's Mars Pathfinder mission, and World Cup Soccer, become extremely popular for days or weeks at a time. Finally, every individual has a few favorite pages that he or she visits on a regular basis.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Not Cache the Web?
By now, you may have the impression that web caching is a wonderful solution without any negative side effects. In fact, there are a number of important issues and consequences to understand about web caching. I'll mention some of them here, with a deeper discussion to follow in Chapter 3.
Unlike more tightly coupled systems, it can be difficult for a web cache to guarantee consistency. This means that a cache might return out-of-date information to a user. Why should this be the case? One important factor is that web servers provide only weak hints about freshness. Many responses don't have any hints at all. On-demand validation is the only way to guarantee a cached response is up-to-date. Given the relatively high latencies involved (compared to other systems), validation can take a significant amount of time. Furthermore, the cache may not even be able to reach the server due to a network or server failure. If a validation request fails, the cache doesn't really know if its response is up-to-date or not. Some caching products can be configured to intentionally return stale responses.
If you've ever set up and maintained a web server, you understand how good it feels to watch the access log file and see people visiting your site. Many content providers feel the same way. They want to know exactly who their users are, which pages they view, and how often. Caches complicate their analysis. Requests served as cache hits are not logged at the origin server. Proxies also tend to hide the identity of users. For example, all users behind a caching proxy come from the same IP address. Furthermore, some products also have features to remove or modify HTTP request headers that can otherwise identify individual users.
Copyright has been controversial with respect to caching for quite some time. Some people feel that caches violate an author's right to control the distribution of her work. The possibility of being sued for copyright infringement prevents some people from providing caching services. HTTP does allow content providers to specify if, and how, their information should be handled and distributed by different types of caches. However, the protocol does not address copyright directly.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Types of Web Caches
Web content can be cached at a number of different locations along the path between a client and an origin server. First, many browsers and other user agents have built-in caches. For simplicity, I'll call these browser caches. Next, a caching proxy (a.k.a. "proxy cache") aggregates all of the requests from a group of clients. Lastly, a surrogate can be located in front of an origin server to cache popular responses. In this book, we'll spend more time talking about caching proxies than the others.
Browsers and other user agents benefit from having a built-in cache. When you press the Back button on your browser, it reads the previous page from its cache. Nongraphical agents, such as web crawlers, cache objects as temporary files on disk rather than keeping them in memory.
Netscape Navigator lets you control exactly how much memory and disk space to use for caching, and it also allows you to flush the cache. Microsoft Internet Explorer lets you control the size of your local disk cache, but in a less flexible way. Both have controls for how often cached responses should be validated. People generally use 10–100MB of disk space for their browser cache.
A browser cache is limited to just one user, or at least one user agent. Thus, it gets hits only when the user revisits a page. As we'll see later, browser caches can store "private" responses, but shared caches cannot.
Caching proxies, unlike browser caches, service many different users at once. Since many different users visit the same popular web sites, caching proxies usually have higher hit ratios than browser caches. As the number of users increases, so does the hit ratio [Duska, Marwood and Feely, 1997].
Caching proxies are essential services for many organizations, including ISPs, corporations, and schools. They usually run on dedicated hardware, which may be an appliance or a general-purpose server, such as a Unix or Windows NT system. Many organizations use inexpensive PC hardware that costs less than $1,000. At the other end of the spectrum, some organizations pay hundreds of thousands of dollars, or more, for high-performance solutions from one of the many caching vendors. We'll talk more about equipment in Chapter 10 and performance in Chapter 12.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Caching Proxy Features
The key feature of a caching proxy is its ability to store responses for later use. This is what saves you time and bandwidth. Caching proxies actually tend to have a wide range of additional features that many organizations find valuable. Most of these are things you can do only with a proxy but which have relatively little to do with caching. For example, if you want to authenticate your users, but don't care about caching, you might use a caching proxy product anyway. I'll introduce some of the features here, with detailed discussions to follow in later chapters of this book.
Authentication
A proxy can require users to authenticate themselves before it serves any requests. This is particularly useful for firewall proxies. When each user has a unique username and password, only authorized individuals can surf the Web from inside your network. Furthermore, it provides a higher quality audit trail in the event of problems.
Request filtering
Caching proxies are often used to filter requests from users. Corporations usually have policies that prohibit employees from viewing pornography at work. To help enforce the policy, the corporate proxy can be configured to deny requests to known pornographic sites. Request filtering is somewhat controversial. Some people equate it with censorship and correctly point out that filtering schemes are not perfect.
Response filtering
In addition to filtering requests, proxies can also filter responses. This usually involves checking the contents of an object as it is being downloaded. A filter that checks for software viruses is a good example. Some organizations use proxies to filter out Java and JavaScript code, even when it is embedded in an HTML file. I've also heard about software that attempts to prevent access to pornography by searching images for a high percentage of flesh-tone pixels.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Meshes, Clusters, and Hierarchies
There are a number of situations where it's beneficial for caching proxies to talk to each other. There are different names for some different configurations. A cluster is a tightly coupled collection of caches, usually designed to appear as a single service. That is, even if there are seven systems in a cluster, to the outside world it looks like just one system. The members of a cluster are normally located together, both physically and topologically. As I explain in Chapter 9, many people like cache clusters because they provide scalability and reliability.
A loosely coupled collection of caches is called a hierarchy or mesh. If the arrangement is tree-like, with a clear distinction between upper- and lower-layer nodes, it is called a hierarchy. If the topology is flat or ill-defined, it is called a mesh. A hierarchy of caches make sense because the Internet itself is hierarchical. However, when a mesh or hierarchy spans multiple organizations, a number of issues arise. We'll talk more about hierarchies in Chapter 7. Then, in Chapter 8, we'll explore the various protocols and techniques that caches use to communicate with each other.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Products
By now you should have a pretty good view of the web caching landscape. In the rest of this book, we'll explore many of the topics of this chapter in much greater detail, so you can fully comprehend all the issues involved. When you finish this book, you'll be able to design and operate a web cache for your environment. You might even think about writing your own software. But since I'm sure most of you have other responsibilities, you'll probably want to use an existing product. Following is a list of caching products that are currently available, many of which are mentioned throughout this book:
Squid
http://www.squid-cache.org
Squid is an open source software package that runs on a wide range of Unix platforms. There has also been some recent success in porting Squid to Windows NT. As with most free software, users receive technical support from a public mailing list. Squid was originally derived from the Harvest project in 1996.
Netscape Proxy Server
http://home.netscape.com/proxy/v3.5/index.html
The Netscape Proxy Server was the first caching proxy product available. The lead developer, Ari Luotonen, also worked extensively on the CERN HTTP server during the Web's formative years in 1993 and 1994. Netscape's Proxy runs on a handful of Unix systems, as well as Windows NT.
Microsoft Internet Security and Acceleration Server
http://www.microsoft.com/isaserver/
Microsoft currently has two caching proxy products available. The older Proxy Server runs on Windows NT, while the newer ISA product requires Windows 2000.
Volera
http://www.volera.com
Volera is a recent spin-off of Novell. The product formerly known as Internet Caching System (ICS) is now called Excelerator. Volera does not sell this product directly. Rather, it is bundled on hardware appliances available from a number of OEM partners.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: How Web Caching Works
What makes an object cachable or uncachable? How does a cache know when a request is a hit or when to revalidate a cached object? This chapter gives you some background on how web caches work. Many of the topics in this chapter are covered definitively in the HTTP/1.1 draft standard document, RFC 2616. The material here tells you what you need to know; for all the gory details, you'll need to consult the actual RFC document.
To start, we'll see what a typical HTTP request looks like and how it's different when talking to a proxy. Next, we'll see how caches decide if they can store a particular response. After that, we'll talk about cache hits, stale objects, and validation techniques. I'll explain how users can force caches to return an up-to-date response. Finally, we'll see what happens when a cache becomes full and must choose to remove some objects.
Clients always use HTTP when talking to a caching proxy. This is true even when the client requests an FTP or Gopher URL, as we'll see shortly. A client issues a slightly different request when it knows it is talking to a proxy server rather than to an origin server. Occasionally, requests to a cache are referred to as proxy HTTP requests.
First, let's examine a request sent to an origin server. In this example, the user is requesting the URL http://www.nlanr.net/index.html. When the client is not configured to use a proxy, it connects directly to the origin server (www.nlanr.net) and writes this request:
GET /index.html HTTP/1.1
Host: www.nlanr.net
Accept: */*
Connection: Keep-alive
In reality, the request includes many more headers than are shown here. Note how the URL has been split into two parts. The request line (the first line) includes only the pathname component of the URL, while the hostname part appears later in a Host header. The Host header is an HTTP/1.1 feature, primarily intended to support virtual hosting of multiple logical web sites on one physical server (one IP address). If the origin server is not serving virtual domains, the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
HTTP Requests
Clients always use HTTP when talking to a caching proxy. This is true even when the client requests an FTP or Gopher URL, as we'll see shortly. A client issues a slightly different request when it knows it is talking to a proxy server rather than to an origin server. Occasionally, requests to a cache are referred to as proxy HTTP requests.
First, let's examine a request sent to an origin server. In this example, the user is requesting the URL http://www.nlanr.net/index.html. When the client is not configured to use a proxy, it connects directly to the origin server (www.nlanr.net) and writes this request:
GET /index.html HTTP/1.1
Host: www.nlanr.net
Accept: */*
Connection: Keep-alive
In reality, the request includes many more headers than are shown here. Note how the URL has been split into two parts. The request line (the first line) includes only the pathname component of the URL, while the hostname part appears later in a Host header. The Host header is an HTTP/1.1 feature, primarily intended to support virtual hosting of multiple logical web sites on one physical server (one IP address). If the origin server is not serving virtual domains, the Host header is redundant. Note that we can rebuild a full URL from the request line and the Host header. This is an important feature of HTTP/1.1, especially for interception proxies.
When a client talks to a proxy, the request is slightly different. The request line of a proxy request includes the full URI:
GET http://www.nlanr.net/index.html HTTP/1.1
Host: www.nlanr.net
Accept: */*
Proxy-connection: Keep-alive
The origin server name is in two places: the full URI and the Host header. This may seem redundant, but when HTTP/1.0 and proxying techniques were invented, the Host header did not exist.
HTTP provides for the fact that requests and responses can pass through a number of proxies between a client and origin server. Some HTTP headers are defined as
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Is It Cachable?
The primary purpose of a cache is to store some of the responses it receives from origin servers. A response is said to be cachable if it can be used to answer a future request. For typical request streams, about 75% of responses are cachable.
A cache decides if a particular response is cachable by looking at different components of the request and response. In particular, it examines the following:
  • The response status code
  • The request method
  • Response Cache-control directives
  • A response validator
  • Request authentication
These different factors interact in a somewhat complicated manner. For example, some request methods are uncachable unless allowed by a Cache-control directive. Some status codes are cachable by default, but authentication and Cache-control take precedence.
Even though a response is cachable, a cache may choose not to store it. Many products include heuristics—or allow the administrator to define rules—that avoid caching certain responses. Some objects are more valuable than others. An object that gets requested frequently (and results in cache hits) is more valuable than an object that is requested only once. Many dynamic responses fall into the latter category. If the cache can identify worthless responses, it saves resources and increases performance by not caching them.
One of the most important factors in determining cachability is the HTTP server response code, or status code. The three-digit status code indicates whether the request was successful or if some kind of error occurred. The status codes are divided into the following five groups:
1xx
An informational, intermediate status. The transaction is still being processed.
2xx
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hits, Misses, and Freshness
When a cache receives a request, it checks to see if the response has already been cached. If not, we say the request is a cache miss, and the request is forwarded on to the origin server. Cache misses occur for objects that have never been requested previously, objects that are not cachable, or objects that have been deleted to make room for new ones. It's common for 50–70% of all requests to be cache misses.
If the object is present, then we might have a cache hit. However, the cache must first decide if the stored response is fresh or stale. A cached response is fresh if its expiration time has not been reached yet; otherwise, it's stale. Fresh responses are best because they are given to the client immediately. They experience no latency and consume no bandwidth to the origin server. I'll call them unvalidated hits. Stale responses, on the other hand, require validation with the origin server.
The purpose of a validation request is to ask the origin server if the cached response is still valid. If the resource has changed, we don't want the client to receive a stale response. HTTP also calls these conditional requests. The reply to a conditional request is either a small "Not Modified" message or a whole new response. The Not Modified reply, also known as a validated hit, is preferable because it means the client can receive the cached response, which saves on bandwidth. A validated miss, where the origin server sends an updated response, is really equivalent to a regular cache miss. We'll talk more about validation in Section 2.5.
How does the cache know whether an object is fresh or stale? HTTP/1.1 provides two ways for servers to specify the freshness lifetime of a response: the Expires header and the max-age cache control directive. The Expires header has been in use since HTTP/1.0. Its value is the date and time at which a response becomes stale, for example:
Date: Mon, 19 Feb 2001 01:46:17 GMT
Application developers find dates such as this awkward for a number of reasons. The format is difficult to parse and prone to slight variations (e.g.,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hit Ratios
How can we measure the effectiveness of our caches? One such measurement is called the cache hit ratio. This is the percentage of requests that are satisfied as cache hits. Usually, this includes both validated and unvalidated hits. Validated hits can be tricky because these requests are forwarded to origin servers, incurring slight bandwidth and latency penalties. Note that the cache hit ratio tells you only how many requests are hits—it doesn't tell you how much bandwidth or latency has been saved.
The measurement that does tell you about bandwidth is called the byte hit ratio. Instead of counting only requests, this measure is based on the number of bytes transferred. Cache hits for large objects contribute more to the byte hit ratio than do small objects. The byte hit ratio measures how much bandwidth your cache has saved, but there are different ways to calculate it.
One way is to compare the sum of object sizes for cache hits and cache misses. However, this technique has a couple of shortcomings. For example, it doesn't include request traffic. Counting the request traffic probably doesn't matter much because it's relatively small, and most of the data flows in the other direction (into your network, not out of it). This technique might not count the small 304 (Not Modified) responses either. However, the bigger problem is in accounting for requests aborted by the user. Consider a cache miss for a 100KB object. If the cache downloads the entire response, but the user aborts the transfer at 50KB, we used more server-side bandwidth than on the client-side. If we instead count bytes transferred on the network, we'll get a more accurate figure for bandwidth savings.
What sort of cache hit ratio values can you expect to achieve? Any reasonably-sized cache should be able to reach 30%. Some of the largest and busiest caches deployed today can make it as high as 70%. Byte hit ratios are normally less than cache hit ratios, often by as much as 10%. That is, a 50% cache hit ratio usually corresponds to a 40% byte hit ratio. We'll see the reason for this in Section A.1. Small objects, such as images and HTML pages, tend to have more cache hits than large objects such as audio and PostScript files.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Validation
I've already discussed cache validation in the context of cache hits versus misses. Upon receiving a request for a cached object, the cache may want to validate the object with the origin server. If the cached object is still valid, the server replies with a short HTTP 304 (Not Modified) message. Otherwise, the entire new object is sent. HTTP/1.1 provides two validation mechanisms: last-modified timestamps and entity tags.
Under HTTP/1.0, timestamps are the only type of validator. Even though HTTP/1.1 provides a new technique, last-modified timestamps remain in widespread use. Most HTTP responses include a Last-modified header that specifies the time when the resource was last changed on the origin server. The Last-modified timestamp is given in Greenwich Mean Time (GMT) with one-second resolution, for example:
HTTP/1.1 200 OK
Date: Sun, 04 Mar 2001 03:57:45 GMT
Last-Modified: Fri, 02 Mar 2001 04:09:20 GMT
For objects that correspond to regular files on the origin server, this timestamp is the filesystem modification time.
When a cache validates an object, this same timestamp is sent in the If-modified-since header of a conditional GET request.
GET http://www.ircache.net/ HTTP/1.1
If-Modified-Since: Wed, 14 Feb 2001 15:35:26 GMT
If the server's response is 304 (Not Modified), the cache's object is still valid. In this case, the cache must update the object to reflect any new HTTP response header values, such as Date and Expires. If the server's response is not 304, the cache treats the server's response as new content, replaces the cached object, and delivers it to the client.
The use of timestamps as validators has a number of undesirable consequences:
  • A file's timestamp might get updated without any change in the actual content of the file. Consider, for example, moving your entire origin server document tree from one disk partition to another. Depending on the method you use to copy the files, the modification times may not be preserved. Then, any
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Forcing a Cache to Refresh
One of the tradeoffs of caching is that you may occasionally receive stale data. What can you do if you believe (or know) that a cache has given you stale data? You need some way to refresh or validate the data received from the cache. HTTP provides a couple of mechanisms for doing just that. Clients can generate requests with Cache-control directives, the two most common of which are no-cache and max-age. We'll discuss no-cache first because it has been around the longest.
The no-cache directive notifies a cache that it cannot return a cached copy. Even if a fresh copy of the response—with a specific expiration time—is in the cache, the client's request must be forwarded to the origin server. RFC 2616 calls such a request an "end-to-end validation" (Section 14.9.4). The no-cache directive is sent when you click on the Reload button on your browser. In an HTTP request, it looks like this:
GET /index.html HTTP/1.1
Cache-control: no-cache
Recall that the Cache-control header does not exist in the HTTP/1.0 standard. Instead, HTTP/1.0 clients use a Pragma header for the no-cache directive:
Pragma: no-cache
no-cache is the only directive defined for the Pragma header in RFC 1945. For backwards compatibility, RFC 2616 also defines the Pragma header. In fact, many of the recent HTTP/1.1 browsers still use Pragma for the no-cache directive instead of the newer Cache-control.
Note that the no-cache directive does not necessarily require the cache to purge its copy of the object. The client may generate a conditional request (with If-modified-since or another validator), in which case the origin server's response may be 304 (Not Modified). If, however, the server responds with 200 (OK), then the cache replaces the old object with the new one.
The interaction between no-cache and If-modified-since is tricky and often the source of some confusion. Consider, for example, the following sequence of events:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Cache Replacement
Cache replacement refers to the process that takes place when the cache becomes full and old objects must be removed to make space for new ones. Usually, a cache assigns some kind of value to each object and removes the least valuable ones. The actual meaning of "valuable" may vary from one cache to another. Typically, an object's value is related to the probability that it will be requested again, thus attempting to maximize the hit ratio. Caching researchers and developers have proposed and evaluated numerous replacement algorithms, some of which are described here.
LRU is certainly the most popular replacement algorithm used by web caches. The algorithm is quite simple to understand and implement, and it gives very good performance in almost all situations. As the name implies, LRU removes the objects that have not been accessed for the longest time. This algorithm can be implemented with a simple list. Every time an object is accessed, it is moved to the top of the list. The least recently used objects then automatically migrate to the bottom of the list.
A strict interpretation of LRU would consider time-since-reference as the only parameter. In practice, web caches almost always use a variant known as LRU-Threshold, where "threshold" refers to object size. Objects larger than the threshold size are simply not cached. This prevents one very large object from ejecting many smaller ones. This highlights the biggest problem with LRU: it doesn't consider object sizes. Would you rather have one large object in your cache or many smaller ones? Your answer probably depends on what you wish to optimize. If saving bandwidth is important, you want the large object. However, caching numerous small objects results in a higher hit ratio.
A FIFO replacement algorithm is even simpler to implement than LRU. Objects are purged in the same order they were added. This technique does not account for object popularity and gives lower hit ratios than LRU. FIFO is rarely, if ever, used for caching proxies.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Politics of Web Caching
In this chapter, we will explore some important and difficult-to-solve issues that surround web caching. The issues pertain to many aspects of caching but are primarily targeted at those of us who operate a caching proxy. For example, as an administrator, you have access to information that, if made available to others, can seriously violate the privacy of your users. If your users cannot trust you to protect their privacy, they will not want to use your cache. Hopefully, both you and your users will perceive the cache as something that protects, rather than violates, their privacy.
A particularly thorny issue with web caching involves the rights of content providers to control the copying and distribution of their works. Some people argue that existing copyright laws cannot be applied to the Internet, but most people look for ways to coerce the two together. By some interpretations, web caches are in gross violation of copyright laws. Various rulings by U.S. courts seem to support this view, although none of them specifically address web caching. Similar issues surround so-called offensive material and the liability of system operators whose facilities are used for its transmission.
Other issues explored here include dynamic pages, content integrity, and cache busting. When properly generated by origin servers, dynamic pages do not present any problems for web caches. Unfortunately, the problem of ensuring content integrity is not as easy to dismiss. Without a general-purpose digital signature framework, web users are forced to trust that they receive the correct content from both proxy and origin servers.
Something that makes caching politically interesting is the fact that many different organizations are involved in even the simplest web transactions. At one end is the user, and at the other is a content provider. In between are various types of Internet service providers. Not surprisingly, different organizations have different goals, which may conflict with one another. Some users prefer to be anonymous, but some content providers want to collect a lot of personal information about their customers or visitors. Some ISPs want to sell more bandwidth, and others want to minimize their bandwidth costs.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Privacy
"Privacy is the power to control what others can come to know about you" [Lessig, 1999, p.143]. In the U.S., most people feel they have a right to privacy. Even though the word does not occur in our Constitution, the fourth amendment comes close when talking about "the right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures..." In at least one famous case, the Supreme Court ruled that this amendment does provide for an individual's privacy. Also of relevance is Article 12 of the United Nations Universal Declaration of Human Rights, which states:
No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honor and reputation. Everyone has the right to the protection of the law against such interference or attacks.
Privacy is a very important issue on the Internet as a whole and the Web in particular. Almost everywhere we go in cyberspace, we leave behind a little record of our visit. Today's computer and networking technology makes it almost trivial for information providers to amass huge amounts of data about their audience. As users, you and I might have different feelings about the importance of privacy. As cache operators, however, we have a responsibility always to protect the privacy of our cache users.
Privacy concerns are found in almost every aspect of our daily lives, not just while surfing the Net. My telephone company certainly knows which phone numbers I have dialed. The video store where I rent movies knows what kind of movies I like. My bank and credit card company know where I spend my money. Surveillance cameras are commonplace in stores, offices, and even some outdoor, public places.
In the United States, a consumer's privacy is protected by federal laws on a case-by-case basis. Video stores are not allowed to disclose an individual's rental and sales records without that individual's consent or a court order. However, the law does allow video stores to use detailed personal information for marketing purposes if the consumer is given an opportunity to opt out. Similarly, telephone companies must protect their customer's private information, including call records. There are no federal laws, however, that address consumer privacy in the banking industry. In fact, under the Banking Secrecy Act, banks must report suspicious transactions, such as large deposits, to federal agencies. The latter is intended to aid in the tracking of money laundering, drug trafficking, and other criminal activities. Banks may be subject to state privacy laws, but for the most part, they are self-regulating in this regard.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Request Blocking
Request blocking refers to the act of denying certain requests based on some part of the request itself (usually the URL). Like it or not, a fair amount of the content available on the Web is generally considered to be offensive, the most obvious example being pornography. Some organizations that connect to the Internet feel it necessary to prevent their users from accessing these sites. A web cache is a logical place to implement per-request blocking.
The issues surrounding request blocking fall mostly into the political realm. Furthermore, these issues are not new or unique to the Web. Just as many companies say employees should not make phone-sex calls while at work, they also say workers should not view pornographic web sites. Similarly, a parent might say that children should not have easy access to sexually explicit material, whether in the form of a magazine, video, or web site. It is a policy decision, for employers and parents, whether and to what extent request blocking should be enabled. Classifying material into offensive or inoffensive categories is a political and ideological issue and far beyond the scope of this book. Even if the classification is not controversial, it is unlikely that a particular technique or implementation is perfect. Some legitimate sites may be incorrectly blocked. Similarly, sites that should be blocked may still be allowed through.
Several request-blocking products and services are available. Some of these are "plug-ins" for web cache products; others are full proxy implementations that can be used alone or in serial with an existing web cache. The companies offering these products also provide a list of sites (or URLs) to be blocked. Usually these products require a subscription fee to receive list updates. However, some allow new sites to be added manually. A typical blocking list probably includes 100,000 or more entries.
The World Wide Web Consortium (http://www.w3c.org) has developed a content labeling scheme known as the Platform for Internet Content Selection. PICS is simply a standard way to label web pages rather than rate them. In other words, PICS specifies the structure of a label, not what to put inside it. However, PICS is often associated with content filtering, because that was one of the primary reasons for its development. A PICS-aware web cache can filter out requests based on one or more rating schemes.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Copyright
Copyright laws give authors or creators certain rights regarding the copying and distribution of their original works. These laws are intended to encourage people to share their creative works without fear that they will not receive due credit or remuneration. Copyrights are recognized internationally through various treaties (e.g., the Berne Convention and the Universal Copyright Convention). This helps our discussion somewhat, because the Internet tends to ignore political boundaries.
Digital computers and the Internet challenge our traditional thinking about copyrights. Before computers, we only had to worry about making copies of physical objects such as books, paintings, and records. The U.S. copyright statute defines a copy thusly:
Copies are material objects…in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.
When you memorize a poem, thereby making a copy in your brain, you have not violated a copyright law. Tests for physicality are difficult to apply to computer systems where information exists only as electrostatic or magnetic charges representing ones and zeroes.
Copying is a fundamental characteristic of the Internet. An Internet without copying is like a pizza without cheese—what would be the point? People like the Internet because it lets them share information with each other. Email, newsgroups, web pages, chat rooms: all require copying information from one place to another. The Internet also challenges traditional copyrights in another interesting way. Revenue is often the primary reason we establish and enforce copyrights. I don't want you to copy this book and give it to someone else because I get a bigger royalty check if your friend buys his own copy. On the Internet, however, paying for information is the exception rather than the rule. Some sites require subscriptions, but most do not, and a lot of web content is available for free.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Offensive Content
As I mentioned earlier, pornography and other potentially offensive material comprise a noticeable proportion of web content. Most national (and some local) governments have laws that address the selling or transportation of pornography. Could, you, as a cache operator, be liable because your cache stores or distributes such material? Only one thing is certain: there are no simple answers.
In 1996, the U.S. government passed the Communications Decency Act (CDA), which attempted to criminalize the sending of pornography and other obscene material over "telecommunications facilities." Furthermore, it sought to make liable anyone who:
knowingly permits any telecommunications facility under [his] control to be used for any activity prohibited [above] with the intent that it be used for such activity…
This might not be as bad as it initially sounds, especially given the use of the words "knowingly" and "intent." Even better, the law also seems to provide an exemption for some providers:
No person shall be held [liable] solely for providing access or connection to or from a facility, system, or network not under that person's control, including transmission, downloading, intermediate storage, access software, or other related capabilities that are incidental to providing such access or connection that does not include the creation of the content of the communication.
This is good news for those of us who operate caches. The presence of the phrase "intermediate storage" is particularly comforting.
However, most of the provisions of the CDA were struck down as unconstitutional by the U.S. Supreme Court in 1997. The CDA was strongly opposed by groups such as the Electronic Freedom Foundation and the American Civil Liberties Union because it violates some fundamental rights (such as freedom of speech) granted by the Bill of Rights.
Admittedly, the discussion here has been very U.S.-centric. Laws of other countries are not discussed here, except to note that the Internet is generally not subject to geopolitical boundaries. The application of local and national laws to a network that connects millions of computers throughout the world is likely to be problematic.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Dynamic Web Pages
Many people worry that caches do not properly deal with what they call "dynamic pages." Such pages are considered dynamic because the content might be different for every request. Time-sensitive information, such as stock prices and weather reports, logically fall into the category of dynamic pages. Pages that have been customized for the user, to include his name or targeted advertisements, are dynamic as well.
Historically, dynamic pages have been problematic because some web caching software had relatively aggressive caching and refresh policies. Dynamic pages were cached and returned as cache hits when perhaps they should not have been. Part of the blame lies with the early descriptions and implementations of HTTP. The HTTP/1.0 RFC [Berners-Lee, Fielding and Frystyk, 1996] was not published until May of 1996, while active development on web caching had been ongoing since early 1994. Without a stable protocol description, implementors are certainly prone to make some mistakes. Even when HTTP/1.0 became official, it still lacked a good description of what can and cannot be cached. Section 1.3 of RFC 1945 states:
Some HTTP/1.0 applications use heuristics to describe what is or is not a "cachable" response, but these rules are not standardized.
Some of the blame lies with the origin servers, however. Even though HTTP/1.1 has more caching features than HTTP/1.0, the older protocol does have enough functionality to prevent a compliant proxy from returning hits on dynamic pages. Unfortunately, confusion arises when an origin server's response leaves out some headers. For example, consider a reply such as this:
HTTP/1.0 200 OK
Server: MasterBlaster/1.6.9
Date: Mon, 15 Jan 2001 23:01:43 GMT
Content-Type: text/html
How should a cache interpret this? There is no Last-modified date, so the cache has no idea how old the resource is. Assuming the server's clock is correct, we know how old the response is, but not the resource. There is no Expires header either, so perhaps the cache can apply local heuristics. Should a cache be allowed to store this reply? Can it return this as a cache hit? In the absence of rigid standards, some applications might consider this cachable.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Content Integrity
Can you trust the information you receive from a cache? How do you know it has not been modified? How do you know it is what the origin server intends for you to see?
This is an extremely difficult problem, with no known solutions at this time. TCP does not currently provide any form of end-to-end security, which means this problem is not specific to HTTP or the Web. The Transport Layer Security protocol (TLS, formerly Secure Sockets Layer) does provide end-to-end security on top of the network transport protocols. TLS protocols [Dierks And Allen, 1999] are designed to prevent eavesdropping, tampering, and message forgery. However, the security provided by TLS is in effect only for the duration of the data transfer. It does not guarantee—especially for cache hits—that the object you receive has not been modified since the origin server generated it. Unfortunately, we do not have a general purpose digital signature scheme for web objects. Even if such a thing did exist, to be of any real value it would require out-of-band communication for the key exchange. In other words, it would be pointless to retrieve signing keys from the cache.
Recent security features being added to DNS [Eastlake, 1999] might be able to support a scheme for authenticating web objects. For example, lets say you request the URL http://www.monkeybrains.net/index.html. The response is an HTML page that includes, in comments, a digital signature. To validate the signature, you need the public key of the author or owner. Such keys can be entered into a DNS zone. Continuing with our example, we query the DNS for a http.www.monkeybrains.net KEY record. The returned key (if any) and the signature are enough to prove that the HTML page is authentic.
To date, I am not aware of any caches that have been broken into and had cache content modified. However, on numerous occasions, origin server security has been compromised, and the perpetrators have replaced the normal home page content with something else. Usually these pranks are short-lived and not a real problem. If the bogus pages make it into web caches, though, some users could receive the wrong content even after the origin server has been restored.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Cache Busting and Server Busting
Cache busting is a technique that content providers use to prevent their pages from being served as hits from caches. Often this means making every response uncachable. This issue is difficult to assess for a number of reasons. First and foremost, owners and publishers have legal rights to control the distribution of their information. Whether or not we agree with their decision to defeat caching, the choice is theirs to make. Usually, their reasons are unknown to us, but the reasons might include the copyright issues discussed previously or the desire to increase the number and accuracy of their hit counts. Second, some content providers, by the very nature of their business, might serve only uncachable content. We should not be surprised to find that sites that exist only to count advertisement impressions do not allow their responses to be cached. Issues relating to advertising are explored further in the next section.
How would someone be able to claim that an origin server is cache busting? Cache users and administrators sometimes expect certain types of objects to be cachable by default. When users visit a page for the first time and then access it again a short while later, they expect the page to load very quickly because it should be in the cache. When the page loads slowly, they wonder why. If a user is curious and savvy enough, she might find a way to examine the reply headers firsthand. With access to the cache log files, administrators can easily analyze them and generate reports including hit ratios for individual origin servers. Those servers that give a lower than average amount of hits, or no hits at all, might be suspects for cache busting.
On the other side of this issue are the Internet service providers who pay high, or perhaps metered, tariffs for their bandwidth. They turn to caching as a way to save money. In a sense, cache-busting web servers represent additional costs for the ISP. When Internet charges are usage-based, rather than flat-rate, people often feel they have purchased information when they download it, and they should not have to pay to download it again.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Advertising
Content preview·Buy PDF of this chapter|