Tagging cache keys for O(1) batch invalidation

Recently I've been spending some quality time trying to decrease page load times and decrease the number of database accesses on a site I'm working on. As you would probably suspect, that means dealing with caching. One common thing that I need to do, however, is invalidate a large group of cache keys when some action takes place. I've devised a pattern for doing this, and while I'm sure it's not novel, I haven't seen any recent write-ups of this technique. The base idea is that we're going to add another thin cache layer, and use the value from that first layer in the key to the second layer.

First, let me give a concrete example of the problem that I'm trying to solve. I'm going to use Django/Python from here on in, but you could substitute anything else, as this pattern should work across other frameworks and even other languages.

import datetime
from django.db import models

class Favorite(models.Model):
    user = models.ForeignKey(User)
    item = models.ForeignKey(Item)
    date_added = models.DateTimeField(default=datetime.datetime.now)

    def __unicode__(self):
        return u'%s has favorited %s' % (self.user, self.item)

Given this model, now let's say that we have a function that gets the Favorite instances for a given user, which might look like this:

def get_favorites(user, start=None, end=None):
    faves = Favorite.objects.filter(user=user)
    return list(faves[start:end])

There's not much here yet--we're simply filtering to only include the Favorite instances for the given user, slicing it based on the given start and end numbers, and forcing evaluation before returning a list. Now let's start thinking about how we will cache this. We'll start by just implementing a naive cache strategy, which in this case simply means that the cache is never invalidated:

from django.core.cache import cache

def get_favorites(user, start=None, end=None):
    key = 'get_favorites-%s-%s-%s' % (user.id, start, end)
    faves = cache.get(key)
    if faves is not None:
        return faves
    faves = Favorite.objects.filter(user=user)[start:end]
    cache.set(key, list(faves), 86400 * 7)
    return faves

Now we come to the hard part: how do we invalidate those cache keys? It's especially tricky because we don't know exactly what keys have been created. What combinations of start/end have been given? We could invalidate all combinations of start/end up to some number, but that's horribly inefficient and wasteful. So what do we do? My solution is to introduce another layer. Let me explain with code:

import uuid
from django.core.cache import cache

def favorite_list_hash(user):
    key = 'favorite-list-hash-%s' % (user.id,)
    cached_key_hash = cache.get(key)
    if cached_key_hash:
        key_hash = cached_key_hash
    else:
        key_hash = str(uuid.uuid4())
        cache.set(key, key_hash, 86400 * 7)
    return (key_hash, not cached_key_hash)

Essentially what this gives us is a temporary unique identifier for each user, that's either stored in cache or generated and stuffed into the cache. How does this help? We can use this identifier in the keys to the get_favorites function:

from django.core.cache import cache

def get_favorites(user, start=None, end=None):
    key_hash, created = favorite_list_hash(user)
    key = 'get_favorites-%s-%s-%s-%s' % (user.id, start, end, key_hash)
    if not created:
        faves = cache.get(key)
        if faves is not None:
            return faves
    faves = Favorite.objects.filter(user=user)[start:end]
    cache.set(key, list(faves), 86400 * 7)
    return faves

As you can see, the first thing we do is grab that hash for the user, then we use it as the last part of the key for the function. The whole if not created thing is just an optimization that helps to avoid cache fetches when we know they will fail. Here's the great thing now: invalidating all of the different cached versions of get_favorite for a given user is a single function call:

from django.core.cache import cache

def clear_favorite_cache(user):
    cache.delete('favorite-list-hash-%s' % (user.id,))

By deleting that single key, the next time get_favorites is called, it will call favorite_list_hash which will result in a cache miss, which will mean it will generate a new unique identifier and stuff it in cache, meaning that all of the keys for get_favorites are instantly different. I think that this is a powerful pattern that allows for coarser-grained caching without really sacrificing much of anything.

There is one aspect of this technique that some people will not like: it leaves old cache keys around taking up memory. I don't consider this a problem because memory is cheap these days and Memcached is generally smart about evicting the least recently used data.

I'm interested though, since I don't see people posting much about nontrivial cache key generation and invalidation. How are you doing this type of thing? Are most people just doing naive caching and calling that good enough?

It's Caches All the Way Down

A few years back a non-technical person asked me to explain what, exactly, the operating system did. I first started out by speaking in broad generalities, saying that it takes care of the basic needs of the computer and that it helps one thing talk to another. But that answer wasn't good enough. This person wanted me to go into more detail. What happens when I'm in the calculator application and I type 1+1 and hit enter? What things need to happen?

That really tripped me up. It's hard to even know where to start when someone who's curious like that asks a relatively innocent question like that. So I decided to just start from the basics. "A processor has these things called registers", I said. "And they store a tiny little amount of information, which they can do things with, like add and subtract." That seemed to basically satisfy them, but not me.

I made sure to say that, "You can't just keep everything in registers, because there are only a few of those. So you have to keep some data stored in this place with more space for data but it's slower." To which they responded, "Oh, so is that why when I add more RAM my computer gets faster?" "No," I said, "this is all still on the processor." "OK, so what's RAM then and why is it supposed to help?" they asked. "Because we can't store everything on the processor itself, because there's not enough space for data, so we have to store it in this place with much more space for data, but which is slower."

At this point I realized how redundant this conversation was going to get. It's the same thing for the hard drive, and (if you're old-school) tape drive or (if you're new-school) the internet. At the same point that person seemed to get bored--that moment of curiosity had passed. Computers just do their thing and we really didn't have to do anything special for all of that data movement to take place.

That's what struck me. When you come down to it, computers are just a waterfall of different caches, with systems that determine what piece of data goes where and when. For the most part, in user space, we don't care about much of that either. When writing a web application or even a desktop application, we don't much care whether a bit is in the register, the L1 or L2 cache, RAM, or if it's being swapped to disk. We just care that whatever system is managing that data, it's doing the best it can to make our application fast.

But then another thing struck me. Most web developers DO have to worry about the cache. We do it every day. Whether you're using memcached or velocity or some other caching system, everyone's manually moving data from the database to a cache layer, and back again. Sometimes we do clever things to make sure this cached data is correct, but sometimes we do some braindead things. We certainly manage the cache keys ourselves and we come up with systems again and again to do the same thing.

Does this strike anyone else as inconsistent? For practically every cache layer down the chain, a system (whose algorithms are usually the result of years and years of work and testing) automatically determines how to best utilize that cache, yet we do not yet have a good system for doing that with databases. Why do you think that is? Is it truly one of the two hard problems of computer science? Is it impossible to do this automatically? I honestly don't have the answers, but I'm interested if you do.

Memcached and OpenWebLoad

There has been a somewhat minor update to the site. Now almost all of the site is cached. That is, there is a memcached backend running on the server, and this website can take advantage of that.

To test out the site's load capabilities, I used an open source program called OpenWebLoad. It's quite impressive in its simplicity and utility. With it, I was able to test a load of over 50 simultaneous users to the website. Surprisingly, the site handled it with flying colors.

On a side note, I went to Wil Wheaton's site after a long stint of not visiting it, and it's still as good as it's ever been. There's an almost tangible difference between a writer's website and a blogger's website. I think that it boils down to the fact that using only words, Wil can make you feel the emotion that he wants you to feel.

Search

Badges

  • django badge
  • apache badge
  • GeoURL
  • XFN Friendly
  • Valid HTML 4.01 Transitional