Everything2
Near Matches
Ignore Exact
Full Text
Everything2

Making your web site more cache friendly

created by AT

(idea) by AT (2 mon) (print)   ?   1 C! I like it! Mon Sep 23 2002 at 19:20:25

Caching in a Nutshell

Since the beginning of the web, browsers have performed tricks to get as much performance out of the web as possible. One of the most effective tricks browsers have is caching web content. It works like this: the first time you browse to a given site, your browser downloads all the HTML, GIF images, JPEG images, etc. and displays them. But it also stores them for later, so the next time you visit that site, it can save the download time by just reloading them from your disk rather than reloading them from the internet.

In addition, some organizations run caching web proxies. These programs act as a middle man to the web. Each browser asks the proxy for all of its web pages instead of downloading them directly from the internet. Just like the browsers, the web proxy stores all the files downloaded from the web, and when a request comes in for something it has already seen, it sends it back to the browser without reloading it from the internet. By reducing the amount of duplicate information downloaded, the organization saves network bandwidth, and the web pages generally load faster for the users (it also provides the organization with a convenient method to monitor what its members are browsing, but thats a topic for another writeup).

The problem of dynamic content

For the above scheme to work correctly, unfortunately, the original web pages that have been copied into caches must never change. If they do, then browsers and proxies will end up serving content that is out of date.

Web documents generally fall into one of four categories:

  1. those that will be different each time its loaded (example: a page generated dynamically by a web application);
  2. those that change very often on a regular basis (example: the main page of a news site);
  3. those that change occasionally on an irregular basis (example: a personal homepage or an instruction manual);
  4. and finally those that never change (example: a company logo image, an archived article).
Each individual piece of the page can be in its own category: for example, a search page that always changes could contain a logo image that never changes. Some of these categories are great candidates for caching, like the static image. Some are terrible, like the dynamically generated web page. Unfortunately, its not always easy for the browser/proxy to know which is which.

Cache control

If the web server doesn't say anything about caching a particular page or image, the browser/proxy generally makes an educated guess based on a few rules. First, if it uses HTTP authentication or SSL, it will not be cached for security reasons. Also, if the server provides the date and time the page was last modified with the Last-Modified header (which most web servers automatically do), it will use a cached version if the page hasn't been modified for a while. Additionally, browsers are less likely to cache the results of a form submission or an URL that has parameters.

The HTTP specification allows the web server itself to tell the browser/proxy what it should really do. It can say that this page will only change at a given time, and so the cached version can be considered valid until then; by setting this time to a large enough interval, it can effectively say this page never changes. Or it can say this page shouldn't be cached at all.

The web server does this with some special, optional headers. The most important of these is Cache-Control. It can have the following values:

  • max-age=x - This means the browser or proxy can reuse the object in its cache for x seconds; after that, it must reload it to ensure it is fresh.
  • public - This means the page can be cached, even if it uses SSL or HTTP authentication; usually those pages are not cached for security reasons.
  • no-cache - Don't cache the page ever.
  • must-revalidate - This means follow the rules strictly; the HTTP specification gives browsers some leeway in deciding when pages should be reloaded; this tells them to follow the rules to the letter.
There are two other headers that provide similar functionality: Expires and Pragma. Expires is similar to setting the max-age with Cache-Control, except you specify the date and time until which the page is valid, not the interval. Pragma: no-cache is approximately equivalent to Cache-Control: no-cache; it was a convention used before HTTP/1.1 defined a proper way to do it; it may be useful for clients that still only understand HTTP/1.0 or lower.

Why You Should Control the Cache

By investing a little time and thought, you can reduce your bandwidth, improve the perceived performance, and ensure your users see up-to-date content. The time, bandwidth and server load saved not transferring redundant copies of static content may even improve performance of the dynamic elements elsewhere on your site. The question is not, "Why should I do this", it is, "Why haven't I already"!

A common worry of content providers is, "if I encourage caching, I won't have accurate statistics on how my website is accessed". This is mostly a non-issue, because you can still rely on the statictics from your dynamic content. You can also set up a small, non-cachable image on each of your pages specifically for this purpose.

Strategies for webmasters and programmers

First, don't use HTML META taga with http-equiv to generate these headers. Caching proxies usually only look at the HTTP headers, not the HTML content. They simply won't work with most proxies.

  1. Find out how to add these headers to your content:
    • Apache: Compile with --enable-module=headers and/or --enable-module=expires; add configuration directives to either access.conf or .htaccess.
    • MS IIS: select the web site in Administration Tools and bring up the properties. The two relevant options are Enable Content Expiration and Custom HTTP headers.
    • CGI scripts can print out the headers directly.
    • PHP can use the Header statement to generate the headers.
    • ColdFusion can use the CFHEADER tag.
    • ASP can set the Response.Expires and Response.CacheControl variables.
    • Java Servlets can use the HttpServletResponse.addHeader and HttpServletResponse.addDateHeader methods.
  2. Find out a way to view the page headers that are coming back:
    • This allows you to ensure your changes really work.
    • My personal favorite is http://webtools.mozilla.org/web-sniffer/.
  3. Determine which content is truly static and apply headers to reduce reloading:
    • Generally, images are good candidates, especially if they are used on multiple pages of the site.
    • Make sure you refer to your static content consistantly. Don't use different URLs to serve the same content; for example, don't embed a user id in the URL of content that is the same for every user.
    • By having a policy of giving updated content a new name you have your mostly static content never expire. For example if your company updates their logo every year or so, leave the old logo alone and put the new logo on the server with a new name. Then update all the pages on to point to the new logo. Or better yet, have the logo URL redirect to a real, versioned logo URL.
    • Example: the google logo (http://www.google.com/images/logo.gif) expires on January 17, 2038.
  4. Determine which content is truly dynamic and apply headers to indicate as such:
    • Cached dynamic pages are the source of many debugging nightmares when developing web applications. This whole category of problems can be fixed by ensuring dynamic content is not out of date.
  5. Determine which SSL or HTTP authenticated content is not sensitive and can be cached:
    • This can make a world of difference on secure sites; remember, browsers and proxies are not supposed to cache anything here, so the server is hit with a full reload each time it is accessed.
    • Cache-Control: public is your friend.

printable version
chaos

Stuck behind an MS proxy server URL escape sequences How to replicate a dynamic website quickly without the source code or database HTML form double submit bug
Java HTTP daemon publishing scripted pages on free hosts HTTP tunneling Sending cash through the mail
Search engine optimization PHP DHTML implementation in Netscape and Internet Explorer META http-equiv
Strict HTML robots.txt ASIS Italian Greyhound
User Authentication ColdFusion cold fusion IIS
performance Visual Basic for Applications clean URLs Night Train
Y'know, if you log in, you can write something here, or contact authors directly on the site. Create a New User if you don't already have an account.
  Epicenter
Login
Password

password reminder
register

Everything2 Help

Cool Staff Picks
Drink up!
The Capuchin Boneyard
The cactus that told me my mom was asleep
How to tell whether a figure can be drawn in one stroke
Everything Quests: Hollywood's Golden Age
A foot of jade is worth an inch of time
Poly Geek Relationship Disclosure Form
The node that tried too hard to be loved
Combat Infantryman Badge
Thanotropism
hydrogen bond
banana trick
non-standard analysis will never be mainstream
Jesus driving out the Temple moneychangers
New Writeups
cryforhelp
Major dictionaries of the world(review)
Glowing Fish
The Uncanny X-Men and the New Teen Titans(thing)
WolfKeeper
Launch loop(idea)
TendoKing
Katana(person)
Wuukiee
Highly ornamental cultivars of brambles still have as many thorns as their wild counterparts(idea)
TheDeadGuy
Editor Log: May 2008(log)
everyday j.Lo
pray do not molest them(thing)
ammie
Bands Who Take Their Names from Eighteenth-century English Poetry and Prose(idea)
shaogo
Under My Thumb(review)
ammie
Rock On(person)
The Custodian
The Dresden Files(thing)
Ouzo
PETA becomes you, a proposed future(fiction)
Ereneta
Stone Soup, Part Two(fiction)
jjen
Sorrier than I ever thought I would be(personal)
locke baron
Moskva class antisubmarine cruiser(thing)
E2 is a by-product of the existence of The Everything Development Company