Drop-dead simple Django caching

Caching is easy to screw up. Usually it's a manual process which is error-prone and tedious. It's actually quite easy to cache, but knowing when to invalidate which caches becomes a lot harder. There is a subset of caching the caching problem that, with Django, can be done quite easily. The underlying idea is that every Django model has a primary key, which makes for an excellent key to a cache. Using this basic idea, we can cover a fairly large use case for caching, automatically, in a much more deterministic way. Let's begin.

First, we need to decide upon a setting for how long each individual item should be saved in the cache. I'm going to call that SIMPLE_CACHE_SECONDS and grab it like so:

from django.conf import settings

SIMPLE_CACHE_SECONDS = getattr(settings, 'SIMPLE_CACHE_SECONDS', 2592000)

The next thing we need to do is be able to generate a cache key from an instance of a model. Thanks to Django's _meta information, we can get the app label and model name, plus the primary key, and we're all set.

def key_from_instance(instance):
    opts = instance._meta
    return '%s.%s:%s' % (opts.app_label, opts.module_name, instance.pk)

So now let's start setting the cache! My preferred way to do it is via a signal, but you could do it in a less generic way by overriding save on a model. My signal looks like this:

from django.core.cache import cache
from django.db.models.signals import post_save

def post_save_cache(sender, instance, **kwargs):
    cache.set(key_from_instance(instance), instance, SIMPLE_CACHE_SECONDS)
post_save.connect(post_save_cache)

Now that we're putting items in the cache, we should probably delete them from the cache when the model instance is deleted:

from django.db.models.signals import pre_delete

def pre_delete_uncache(sender, instance, **kwargs):
    cache.delete(key_from_instance(instance))
pre_delete.connect(pre_delete_uncache)

This is all good and well, but right now we don't really have a way to get at that information. Cache is pretty useless if we never use it! Our interface to the database is through the model's QuerySet, so let's make sure that our QuerySet is making good use of our newly-populated cache. To do so, we'll subclass QuerySet:

from django.db.models.query import QuerySet

class SimpleCacheQuerySet(QuerySet):
    def filter(self, *args, **kwargs):
        pk = None
        for val in ('pk', 'pk__exact', 'id', 'id__exact'):
            if val in kwargs:
                pk = kwargs[val]
                break
        if pk is not None:
            opts = self.model._meta
            key = '%s.%s:%s' % (opts.app_label, opts.module_name, pk)
            obj = cache.get(key)
            if obj is not None:
                self._result_cache = [obj]
        return super(SimpleCacheQuerySet, self).filter(*args, **kwargs)

The only method that we really need to overwrite is filter, since get and get_or_create both just rely on filter anyway. The first for loop in the filter method just checks to see if there is a query by id or pk, and if so, then we construct a key and try to fetch it from the cache. If we found the item in the cache, then we place it into Django's internal result cache. At that point we're as good as done. Then we just let Django do the rest!

This SimpleCacheQuerySet won't be used all on its own though, we need to actually force a model to use it. How do we do that? We create a manager:

from django.db import models

class SimpleCacheManager(models.Manager):
    def get_query_set(self):
        return SimpleCacheQuerySet(self.model)

Now that we have this transparent caching library set up, we can go around to all of our models and import it and attach it as needed. Here's how that might look:

from django.db import models
from django_simplecache import SimpleCacheManager

class BlogPost(models.Model):
    title = models.TextField()
    body = models.TextField()

    objects = SimpleCacheManager()

That's it! Just by attaching this manager to our model we're getting all the benefits of per-object caching right away. Of course, this isn't comprehensive. It does hit the vast majority of use cases, though. If you were to use this for a real site, however, then you wouldn't be able to use update method. It's a little bit trickier since there's no post_update signal, but it's nowhere near impossible. Let's just say that, for now, it's being left unimplemented as an exercise for the reader. in_bulk would be actually quite fun to implement, too, because you could get all of the results possible from cache, and all the rest could be gotten from the database, then merge those two dictionaries before returning.

I think this would be a really good reusable Django application. Essentially, we've grown a library from the ground up that really isn't all that much code. I think it took me 20 minutes to write the actual code, but with some serious polish and love, this library could evolve into something that I think many reusable apps would use to great benefit. What do you think? What should a good, simple, Django caching library have?

It's Caches All the Way Down

A few years back a non-technical person asked me to explain what, exactly, the operating system did. I first started out by speaking in broad generalities, saying that it takes care of the basic needs of the computer and that it helps one thing talk to another. But that answer wasn't good enough. This person wanted me to go into more detail. What happens when I'm in the calculator application and I type 1+1 and hit enter? What things need to happen?

That really tripped me up. It's hard to even know where to start when someone who's curious like that asks a relatively innocent question like that. So I decided to just start from the basics. "A processor has these things called registers", I said. "And they store a tiny little amount of information, which they can do things with, like add and subtract." That seemed to basically satisfy them, but not me.

I made sure to say that, "You can't just keep everything in registers, because there are only a few of those. So you have to keep some data stored in this place with more space for data but it's slower." To which they responded, "Oh, so is that why when I add more RAM my computer gets faster?" "No," I said, "this is all still on the processor." "OK, so what's RAM then and why is it supposed to help?" they asked. "Because we can't store everything on the processor itself, because there's not enough space for data, so we have to store it in this place with much more space for data, but which is slower."

At this point I realized how redundant this conversation was going to get. It's the same thing for the hard drive, and (if you're old-school) tape drive or (if you're new-school) the internet. At the same point that person seemed to get bored--that moment of curiosity had passed. Computers just do their thing and we really didn't have to do anything special for all of that data movement to take place.

That's what struck me. When you come down to it, computers are just a waterfall of different caches, with systems that determine what piece of data goes where and when. For the most part, in user space, we don't care about much of that either. When writing a web application or even a desktop application, we don't much care whether a bit is in the register, the L1 or L2 cache, RAM, or if it's being swapped to disk. We just care that whatever system is managing that data, it's doing the best it can to make our application fast.

But then another thing struck me. Most web developers DO have to worry about the cache. We do it every day. Whether you're using memcached or velocity or some other caching system, everyone's manually moving data from the database to a cache layer, and back again. Sometimes we do clever things to make sure this cached data is correct, but sometimes we do some braindead things. We certainly manage the cache keys ourselves and we come up with systems again and again to do the same thing.

Does this strike anyone else as inconsistent? For practically every cache layer down the chain, a system (whose algorithms are usually the result of years and years of work and testing) automatically determines how to best utilize that cache, yet we do not yet have a good system for doing that with databases. Why do you think that is? Is it truly one of the two hard problems of computer science? Is it impossible to do this automatically? I honestly don't have the answers, but I'm interested if you do.

Django Tip: A Denormalization Alternative

In creating an any website with textual content, you have the choice of either writing plaintext or writing in a markup language of some kind. The immediately obvious choice for markup language is HTML (or XHTML), but HTML is not as human-readable as something like Textile, Markdown, or Restructured Text. The advantage of choosing one of those human-readable alternatives is that content encoded using one of them can be translated very easily into HTML.

When one of my friends started designing his blog using Django, it got me thinking about how best to deal with that translated HTML. It seems like a waste to keep re-translating it every time a visitor views the page, but it also seems like it's redundant to keep the translated HTML stored in the database.

Here's my solution to the problem: cache it. For a month. Here's an example, using Restructured Text:

from django.db import models
from django.contrib.markup.templatetags.markup import restructuredtext
from django.core.cache import cache
from django.utils.safestring import mark_safe

class MyContent(models.Model):
    content = models.TextField()

    def _get_content_html(self):
        key = 'mycontent_html_%s' % str(self.pk)
        html = cache.get(key)
        if not html:
            html = restructuredtext(self.content)
            cache.set(key, html, 60*60*24*30)
        return mark_safe(html)
    content_html = property(_get_content_html)

    def save(self):
        if self.id:
            cache.delete('mycontent_html_%s' % str(self.pk))
        super(MyContent, self).save()

What I'm doing here is writing a method which either gets the translated HTML from the cache, or translates it and stores it in the cache for a month. Then, it returns it as safe HTML to display in a template. The last thing that we do is override the save method on the model, so that whenever the model is re-saved, the cache is deleted.

There we go! We now have the HTML-rendered data that we want, and no duplicated data in the database. Keep in mind that this way of doing things becomes more and more useful the more RAM that your webserver has.

Search

Badges

  • django badge
  • apache badge
  • GeoURL
  • XFN Friendly
  • Valid HTML 4.01 Transitional