This post is my final post in the blog-post-per-day challenge. There have been days when I really didn't want to blog (resulting in posts like this), and there have been days where I was excited about what I was writing about and spent a lot of time on a post (resulting in posts like this one). It was much more difficult and time-consuming than I was expecting. Another thing I wasn't expecting which actually kept motivating me to write more: actual traffic to this site. Previously this site saw maybe 100 hits/day due to mostly various google hits on certain Django topics. When I started doing the blog post-per-day thing, however, this is what my traffic turned into:
But as I started to look at where that traffic was coming from, I realized it wasn't organic at all. (insert sad puppy face here) In fact, most of the traffic was coming from just a few sites. Here's what my top 10 referring sites were:
Once I found out that nearly all my traffic was coming from reddit, I started to try to cater towards that audience. To reddit's credit, however, the more I tried to target content towards what I thought redditers would like, the less successful the articles did over there. Either I was doing a bad job of writing articles for that audience, or they wisened up to my act (I'm thinking the latter is more likely).
I was also very surprised by which articles turned out to be the most popular. Here's the list of articles that I thought were my best:
- Easy Multi-Database Support for Django
- Writing an Markov-Chain IRC Bot with Twisted and Python
- Lambda Calculus
- Drop-dead simple Django caching
- Using CouchDB with Django
- Why CouchDB Rocks
- "Web Hooks"
- Reverse HTTP
Here's the list of my top 8 most popular articles over the past few days, in order, by traffic:
- Gems of Python
- Why CouchDB Sucks
- It's caches all the way down
- The internet is in immediate danger of collapse
- Why use VARCHAR when you can use TEXT?
- Using CouchDB with Django
- Why CouchDB Rocks
- Secrets of the Django ORM
Interestingly enough there are only two posts that appear on both lists. Really, it surprised me which articles got picked up and which didn't. I suppose it has something to do with the sensational titles and the sometimes-controversial posts. Specifically the VARCHAR/TEXT post got much much more push-back than I was expecting. In hindsight, I was wrong to mention anything other than PostgreSQL and SQLite, as those are what I've actually done the TEXT-only experimentation in.
After doing 14 screencasts and now 30 blog posts over the past few months, I'm pretty well spent in terms of creating new, original, content. That's not to say that I'm going to stop writing here or anything like that, but certainly I won't be posting quite as much. When I do, it will be because I legitimately have something to say, instead of because of an obligation.
Thanks for visiting my site while I participated in this challenge. I hope you stick around.
Last week I wrote an article called Why CouchDB Sucks, which many people correctly said should have been called "What CouchDB Sucks at Doing". Nearly everyone pointed out that it was not designed to do the things that I was mentioning in the article. This time around, I'd like to focus on some of the features about CouchDB that I think absolutely rock.
CouchDB is schema-free
One of the most annoying parts of dealing with a traditional SQL database is that you invariably need to change your schemata. This can be done usually with some ALTER TABLE statements, but other times it requires scripts and careful use of transactions, etc. In CouchDB, the solution is to just start using your new schema. No migration needed. If it's a significant change, then you might need to change your views slightly, but nothing as annoying as what would be needed with SQL.
The other advantage of having no schema is that some types of data just aren't well suited to having a strict schema enforced upon them. My CouchDB-based lifestreaming application is a perfect example of the inherent flexibility of CouchDB's schemaless design is that all kinds of disparate information can be stored alongside each other and sorted and aggregated. There's also no reason that you need to use its schema-free nature this way. You could, for example, manually enforce a schema for certain databases, if needed.
CouchDB is RESTFUL HTTP
When is the last time you tried to install MySQL or PostgreSQL drivers for your web development platform of choice? If you're using apt-get it's not so bad, but for just about every other platform, it's a total pain to get these drivers up and running. With CouchDB, there's no need. It speaks HTTP. Want to create a new database? Send an HTTP PUT request. Want to retrieve a document from the database? Send an HTTP GET. Want to delete a database? Send an HTTP DELETE. As you can see, the API is quite straightforward and if a client library doesn't already exist for your language of choice (hint: it does), then it will take you only a few minutes to write one.
But the best part about this is that we already have so many amazing and well-tested tools to deal with HTTP. For example, let's say you want to store one database on one server and another database on another server? It's as simple as setting up nginx or perlbal or varnish as a reverse proxy and having each URL go to a different machine. The same thing goes for transparent caching, etc. Oh, and also, web browsers know how to speak HTTP, too. You could easily write whole web apps served only from CouchDB.
Map/Reduce
Map/Reduce will kill every traditional data warehousing vendor in the market. Those who adapt to it as a design/deployment pattern will survive, the rest won't.
Sounds like someone from Google must have said this, or some Hadoop evangelist, or maybe someone who works on CouchDB. In fact, this comes from Brian Aker, a MySQL hacker who was Director of Architecture at MySQL AB and is now developing the open source fork of MySQL named Drizzle (also a very exciting project in its own right). He's right, too. Google was on to something in a big way when they unveiled their whitepaper on Map/Reduce. It's not the be-all end-all for processing and generating large data sets, but it certainly is a proven technology for that task.
Brian talks about massively multi-core machines which seem the inevitability these days, and we will need to start writing logic that is massively parallelizable to take advantage of these masses of CPUs. Map/Reduce is one way to force ourselves to write logic that can be parallelized. It is a good choice for any new database system to adopt for this reason, and that's why it's great to see that CouchDB has adopted it. It's just one more reason why CouchDB rocks.
So much more
I could talk about how it can handle 2,500 concurrent requests in 10mb of resident memory usage. I could talk about its pluggable view server backends, so that instead of writing views in JavaScript you can write them in Python or any other language (given the correct bindings). I could talk about CouchDBX, which makes installing it on the Mac, quite literally, one click. I could even talk about how it's written in Erlang, with an eye towards scalability. Or maybe about how its database store is append-only.
I could talk about any of those things, and more. It just comes down to this: CouchDB rocks. But don't take my word for it--try it out for yourself!
Caching is easy to screw up. Usually it's a manual process which is error-prone and tedious. It's actually quite easy to cache, but knowing when to invalidate which caches becomes a lot harder. There is a subset of caching the caching problem that, with Django, can be done quite easily. The underlying idea is that every Django model has a primary key, which makes for an excellent key to a cache. Using this basic idea, we can cover a fairly large use case for caching, automatically, in a much more deterministic way. Let's begin.
First, we need to decide upon a setting for how long each individual item should be saved in the cache. I'm going to call that SIMPLE_CACHE_SECONDS and grab it like so:
from django.conf import settings
SIMPLE_CACHE_SECONDS = getattr(settings, 'SIMPLE_CACHE_SECONDS', 2592000)
The next thing we need to do is be able to generate a cache key from an instance of a model. Thanks to Django's _meta information, we can get the app label and model name, plus the primary key, and we're all set.
def key_from_instance(instance):
opts = instance._meta
return '%s.%s:%s' % (opts.app_label, opts.module_name, instance.pk)
So now let's start setting the cache! My preferred way to do it is via a signal, but you could do it in a less generic way by overriding save on a model. My signal looks like this:
from django.core.cache import cache
from django.db.models.signals import post_save
def post_save_cache(sender, instance, **kwargs):
cache.set(key_from_instance(instance), instance, SIMPLE_CACHE_SECONDS)
post_save.connect(post_save_cache)
Now that we're putting items in the cache, we should probably delete them from the cache when the model instance is deleted:
from django.db.models.signals import pre_delete
def pre_delete_uncache(sender, instance, **kwargs):
cache.delete(key_from_instance(instance))
pre_delete.connect(pre_delete_uncache)
This is all good and well, but right now we don't really have a way to get at that information. Cache is pretty useless if we never use it! Our interface to the database is through the model's QuerySet, so let's make sure that our QuerySet is making good use of our newly-populated cache. To do so, we'll subclass QuerySet:
from django.db.models.query import QuerySet
class SimpleCacheQuerySet(QuerySet):
def filter(self, *args, **kwargs):
pk = None
for val in ('pk', 'pk__exact', 'id', 'id__exact'):
if val in kwargs:
pk = kwargs[val]
break
if pk is not None:
opts = self.model._meta
key = '%s.%s:%s' % (opts.app_label, opts.module_name, pk)
obj = cache.get(key)
if obj is not None:
self._result_cache = [obj]
return super(SimpleCacheQuerySet, self).filter(*args, **kwargs)
The only method that we really need to overwrite is filter, since get and get_or_create both just rely on filter anyway. The first for loop in the filter method just checks to see if there is a query by id or pk, and if so, then we construct a key and try to fetch it from the cache. If we found the item in the cache, then we place it into Django's internal result cache. At that point we're as good as done. Then we just let Django do the rest!
This SimpleCacheQuerySet won't be used all on its own though, we need to actually force a model to use it. How do we do that? We create a manager:
from django.db import models
class SimpleCacheManager(models.Manager):
def get_query_set(self):
return SimpleCacheQuerySet(self.model)
Now that we have this transparent caching library set up, we can go around to all of our models and import it and attach it as needed. Here's how that might look:
from django.db import models
from django_simplecache import SimpleCacheManager
class BlogPost(models.Model):
title = models.TextField()
body = models.TextField()
objects = SimpleCacheManager()
That's it! Just by attaching this manager to our model we're getting all the benefits of per-object caching right away. Of course, this isn't comprehensive. It does hit the vast majority of use cases, though. If you were to use this for a real site, however, then you wouldn't be able to use update method. It's a little bit trickier since there's no post_update signal, but it's nowhere near impossible. Let's just say that, for now, it's being left unimplemented as an exercise for the reader. in_bulk would be actually quite fun to implement, too, because you could get all of the results possible from cache, and all the rest could be gotten from the database, then merge those two dictionaries before returning.
I think this would be a really good reusable Django application. Essentially, we've grown a library from the ground up that really isn't all that much code. I think it took me 20 minutes to write the actual code, but with some serious polish and love, this library could evolve into something that I think many reusable apps would use to great benefit. What do you think? What should a good, simple, Django caching library have?
Yesterday I wrote about Web Hooks and how powerful it could be if one web service sends HTTP requests to another web service. Today I want to take that concept one step further. What if you tell that service that you would like it to send a POST request back to you, whenever an event happens? This slight modification makes for a very powerful tool.
Let's take the example of popular real-time web applications like Facebook's instant messenger or FriendFeed's "Real-time" view. Both of these services make use of a technique called long polling, where the client sends an HTTP request and the server does not respond until it has some event to deliver. The client can only keep the request open for so long, so it periodically times out and re-sends the request. (It also re-sends the request if it does receive some data).
The problem with this technique is that it's really trying to turn a client into a server. It's really fighting against the way that HTTP wants to work. So why fight it? Imagine that all of our browsers have simple, lightweight, HTTP servers installed. The client could request to upgrade to reverse HTTP, and then the server could initiate a connection with the client. Now, as events come in to the web service, the service could directly send those updates to the client.
Going back to the example of Facebook IM, here's how that would work: When I open a Facebook page, my client sends a request to Facebook's IM server. Facebook's IM server sends a response with the HTTP/1.1 Upgrade header reading "PTTH/0.9" (funny, huh?). Then, the client knows to accept an HTTP connection from Facebook's IM server. Facebook's IM server then opens that connection with the client, and sends HTTP POSTs every time it receives a new instant message that the client should receive. The client's web browser would have some JavaScript hooks to parse the body of those requests, so that it could update the content of the instant message window on the page.
Isn't this brilliant? It directly meshes with the HTTP protocol, and makes this system which seems like a hack right now, instantly become an elegant solution. I really wish I could take credit for thinking this up, but I did not. My coworker Donovan Preston blew my mind with this a few weeks back. If you're looking for a more visual example of how this might work, or a reference implementation of the protocol in action, check out this wiki page.
A few months back GitHub rolled out its implementation of something that they call "Service Hooks". The idea behind these hooks is that when you commit some new piece of code to GitHub, they want to be able to alert other services that you have committed that code. For example, one of the service hooks is the ability to send a tweet to Twitter, and another of those hooks updates the Lighthouse ticket tracker.
I thought this was a really good idea when they rolled it out, so I did a bit of searching and found out that there is a larger body of work surrounding this idea, and that body of work is called Web Hooks. The central idea behind web hooks is that a user supplies a service that they use with a URL. Then, when that user performs an action on that service, the service agrees to send an HTTP POST request to the user's specified URL, with some information about the action that the user took on the service.
SlideShare has an excellent presentation deck about this idea, which likens it to Unix pipes. That analogy makes a lot of sense if you think about it. With the standard model that most websites follow today, a client can only send requests. This means repeated polling until the client receives the information that it is interested in. With web hooks, however, the service is responsible for passing that action along to the next service. This simple yet powerful mechanism can allow for very advanced systems which interoperate very simply through chaining.
Let's expore a concrete example of what this might look like. A few months back I signed up for a pro account on Flickr, so that I could upload some of the pictures that I had stored on my computer. What I did was to upload some pictures with descriptions, and then I went and posted on Twitter some of the links to those pictures. I also went and added that new Flickr account to FriendFeed so that others could see my pictures as well.
This was all a manual process. If both Flickr and Twitter supported web hooks, I could have simply set up their respective URLs and uploaded my pictures. The process might have happened like this: First, the pictures are uploaded. Then Flickr sends a POST request to Twitter, with the description of the picture and a link to the picture. Twitter sends a POST request to FriendFeed, adding the new item to my FriendFeed lifestream.
You could even write custom scripts to handle the web hooks. For example let's say that I want any tweet with the name 'Kevin' to be sent to my brother's email address. I could add a URL to Twitter linking to a script on my computer which scans the contents of the tweet. If the tweet has the name 'Kevin' in it, it would send an email. If not, it might do nothing.
I think that this concept is very powerful not only in terms of rendering trivial the interoperability between disparate services, but also in terms of simply saving on bandwidth and computing power. Technologies which constantly poll resources hoping for updated content seem silly in comparison to the powerful simplicity that web hooks provide.
There are definitely some drawbacks to a system like this. Firstly, the name: I actually can't think of a worse name for this concept. Web hooks?! Let's come up with something better. All joking aside though, this type of system does face a serious problem when it comes to the question of reliability. If a script receives no POST, it could mean that either no event happened, or that the internet connection went down for a bit, or that the service is down, or any number of other possible things. I think the solution for this is a hybrid model of sparse polling in conjunction with web hooks.
Most of all, this technology just seems so underused. There are ridiculously few people who implement something like this, yet it seems like an undeniably useful service--especially given its relative simplicity to implement. Let's all try to encourage the services that we use on a daily basis to support web hooks, because by doing just that, we can make the web a lot better.
All Content

