meantime: non-consensual http user tracking using caches

From WordNet (r) 1.6 [wn]:
  mean
       2: characterized by malice; "a hateful thing to do"; "in a mean
          mood"; "told spiteful stories about the fat lady" [syn: hateful,
           spiteful]
       3: having or showing a  meanspirited lack of honor or morality;
       4: (slang) excellent; "famous for a mean backhand"

  time
       5: the continuum of experience in which events pass from the
          future through the present to the past

executive summary

HTTP cache-control headers such as If-Modified-Since allow servers to track individual users in a manner similar to cookies, but with less constraints. This is a problem for user privacy against which browsers currently provide little protection.

introduction

Some people would like to be anonymous as they use the web, and other people would like to prevent anonymous access for various reasons. Consider, for example, an internet marketing company that wants to chain together visits to various web sites by a user so as to build a fuller profile of their interests and usage patterns. Conversely, a web user might not wish to leak such information to a site because they are looking at controversial information, desire a good negotiating position, or see privacy as a moral right.

An arms race in techniques for providing and stripping away anonymity has developed over the years. This black paper discusses what is believed to be a new technique for tracking clients and possible responses.

problem statement

Alice is browsing the web; Bob runs a number of otherwise-unrelated web servers. Alice makes several requests to Bob's servers over time. Bob would like to tie together as many as possible of the requests made by Alice to learn more about Alice's usage patterns and identity: we call this identifying the request chain. Alice would like to access Bob's servers but not give away this information.

There are many perfectly good reasons why in a particular situation B might want to know A's identity, or at least a unique pseudonym. If B explains the reasons why tracking is required, then A can consent to and allow tracking in various ways. There are several less savory possibilities when A does not consent to the tracking or does not realize that a single chain can be found across apparently unrelated servers controlled by B.

The scenario poses an interesting information-theory and game-theory challenge in anonymity. It is also immediately practical: there is a good deal of development being done in aid of both Alice and Bob.

existing approaches

cookies

The standard approach for associating user requests across several responses is the HTTP `Cookie' state-management extension. The Cookie response header allows a server to ask the client to store arbitrary short opaque data, which should be returned for future requests of that server matching particular criteria. Cookies are commonly used to store per-user form defaults, to manage web application sessions, and to associate requests between executions of the user agent.

The user agent always has the option to just ignore the Set-Cookie response header, but most implementations default to obeying it to preserve functionality. Cookies can optionally specify an expiry time after which they should no longer be used, that they should persist on disk between client session, or that they should only be passed over transmission-level-secure connections.

The privacy implications of cookies have been extensively discussed, and several problems have been found and recitified in the past. One example of privacy compromise through cookies is the use of cookies attached to banner images downloaded from a central banner server: the same cookie is used within images linked from several servers, and so the user can be tracked as they move around.

other approaches

An obvious means to associate requests is by source IP address. Over the short term this will generally work quite well, as a client is likely to use a single IP address during a browsing session. Even then it is complicated by proxies acting for multiple clients, network address translation, or multiuser machines. Over a longer term, the information is convolved by dynamically-assigned IPs, mobile computers moving between networks, dialup pools and the like. Indeed, cookies were proposed in large part to allow legitimate stateful applications to cope with the impossibility of uniquely identifying users by IP address.

Within a single site, state may be maintained by generating dynamic URLs that include session identification either within the hostname (http://d9128309812.crackmonkey.org/) or path (http://crackmonkey.org/d213213213/faq.html). However, this does not allow tracking between sites and causes a significant loss of functionality because URLs cannot be shared between users or bookmarked.

Single links can be identified by the HTTP Referer header. There are some limitations here, however: this only identifies the immediately preceding resource, and the link is lost if the user re-enters a URL by hand or retrieves it from a bookmarks file.

countermeasures

Users caring to preserve their privacy have taken various countermeasures against these techniques.

To reassure end-users about cookie privacy issues, user agents such as Netscape Navigator, Microsoft Internet Explorer and Lynx allow the user some control. The most basic control is to enable or disable cookies altogether; some user agents allow this to be specified for particular domains. There may be more fine-grained controls, such as only accepting cookies from the same server as the top-level page currently viewed and not from servers for subsidiary requests such as images or frames.

The broadest protection is afforded by the use of a proxy local to the browsers machine, such as Internet Junkbuster. This software rewrites the request to strip out identifying browser and cookie information, in addition to attempting to remove advertising banners.

Various proxying solutions are available to prevent identification by IP address, such as anonymizer.com and CROWDS.

A similar but more powerful attack is possible through the cache-management headers proposed in draft-mogul-http-delta-02.

caching in http

To make access faster and reduce network usage, browsers generally keep a copy of resources such as pages and images that they download. When a client has a cached copy of a page, it can decide either to use the cached copy as is, or to send a request to the server to check that it is up-to-date.

When the client sends a request for the copy it has in cache, it sends a conditional request describing the cached copy and asking the server to only transfer the body of the resource if it is newer than the cached copy.

The most common means of checking this currently in use is the Last-Modified date header. The server supplies a date in the metadata of the response, and the client returns the same date when sending a conditional request.

Other techniques, such as checking the length of the resource body, its MD5 hash, and a unique ETag cookie have also been used.

the meantime exploit

The fundament of the meantime exploit is that the server wishes to `tag' the client with some information that will later be reported back, allowing the server to identify a chain. Cookies are a good approach to this, but their privacy implications are well known and so Bob requires a more surreptitious approach.

The HTTP cache-control headers are perfect for this: the data is provided by the server, stored but not verified by the client, and then provided verbatim back to the server on the next matching request.

Two headers in particular are useful: Last-Modified and ETag. Both are designed to help the client and server negotiate whether to use a cached copy or fetch the resource again.

The general approach of meantime is that rather than using the headers for their intended purpose, Bob's servers will instead send down a unique tag for the client.

Last-Modified is constrained to be a date, and therefore is somewhat inflexible. Nevertheless, the server can reasonably choose any second since the Unix epoch, which allows it to tag on the order of one billion distinct clients.

ETag allows an arbitrary short string to be stored and passed. It is not so commonly implemented in user agents at the moment, and so not such a good choice.

In both cases the tag will be lost if the client discards the resource from its cache, or if it does not request the exact same resource in the future, or if the request is unconditional. (For example, Netscape sends an unconditional response when the user presses Shift+Reload.) Bob has less control over this than he has with cookies, which can be instructed to persist for an arbitrarily long period.

The date is only sent back for the exact same URL, including any query parameters. By contrast, cookies can be returned for all resources in a site or section of a site. This makes Bob's job a little harder.

Bob therefore should make sure that all pages link to a small common resource: perhaps a one-pixel image. This image is generated by a script that supplies and records a unique timestamp to each client, and records whatever is already present.

intermediate proxies.

The presence of proxy caches between the client and the server will complicate the situation for Bob, because if the proxy holds a copy of the resource it might satisfy the request locally or change the cache control criteria. In the extreme case, if the proxy does all the caching and the client none, then Bob will identify all requests through that proxy as a single chain.

Bob need not despair. Proxy usage is still quite low, and there are some indications that people concerned about anonymity will not route their requests through a proxy that might log them.

In fact, a meantime exploit is entirely possible if Bob controls an intermediate proxy. This seems not to be so much of a threat in practice, however, because proxies are most commonly controlled by the administrators of a local network who already have considerable power to trace users.

If intermediate proxies or clients implement expiry heuristics then this can interfere with tracking, but not irredeemably so.

demonstration

We have some proof-of-concept code written in Python, which places a tag in your browser's cache, and allows you to associate a short string with it on on our server. It should persist as long as the record remains in your browser's cache.

For various reasons the demonstration is no longer running on this web site.

source code

results

This code is a demonstration of the principle, rather than a full implementation of tracking. Nevertheless:

It works quite reliably against Netscape.

Lynx apparently never sends conditional requests, and so is safe.

Junkbuster does not prevent tracking.

anonymizer.com seems to keep a cache on their servers and rewrites the page as it passes through, so it seems to be safe: all anonymizer.com users appear as one.

implications

Anonymizing software should probably strip out all cache headers. Unfortunately this will slow down access and waste network bandwidth, but it seems necessary that the client should not return any information to the server if it is to preserve anonymity.

Possibly Alice should ask her client to never refresh cached requests unless explicitly requested: this will maintain performance for the common case of unchanged pages. When the page must be refreshed, she should be careful that no information about the previously cached copy is emitted.

If all of Alice's requests were directed through an anonymizing proxy crowd it would be harder to associate the tagged requests with her other activities, but not infeasible.

Clients could try to manipulate the modification times to give Bob less room to move: for example, they could round off the time to the lowest minute, and could clip times to be no more than a year from the current date. But this still leaves several bits of the value under Bob's control: even separating users into equivalence classes based on where they first accessed this site might be interesting, for example.

Designers of future protocols should consider similar tagging security issues. For example, although ETags allow better cache consistency problems than Last-Modified headers, they make tracking even easier by allowing the server to store arbitrary data on the client.

references

There was some discussion of problems with Last-Modified on the HTTP Working Group mailing list.

coverage

Martin Pool is an invited speaker for the Privacy by Design conference organized by Zero Knowledge Systems in November 2000.

meantime featured in a slashdot story. Although Slashdot happened to post the story on April 1, the paper was released much earlier and is serious.

meantime's also mentioned in Bruce Schneier's October 2000 CRYPTO-GRAM.

revision history

2001-01-09

Shut down demonstration.

2000-03-28

Cleared up the explanation.

2000-03-29

Further revise the text after feedback from OzLabs hackers.

Add Cache-Control headers to the demo to try to be more in the style of HTTP/1.1. Add a form through which users can record a string associated with their cache. Also keep track of how many times they have visited the page.

Test against anonymizer.com and junkbuster.