My Thoughts on NoSQL

Over the past few years, relational databases have fallen out of favor for a number of influential people in our industry. I'd like to weigh in on that, but before doing so, I'd like to give my executive summary of the events leading up to this movement:

In the late nineties and early thousands, websites were mostly read-only--a publisher would create some content and users would consume that content. The data access patterns for these types of applications became very well-understood, and as a result many tools were created and much research and development was done to further develop these technologies.

As the web has grown more social, however, more and more it's the people themselves who have become the publishers. And with that fundamental shift away from read-heavy architectures to read/write and write-heavy architectures, a lot of the way that we think about storing and retrieving data needed to change.

Most people have done this by relying less on the features provided by traditional relational databases and engineering more database logic in their application code. Essentially, they stop using relational databases the way they were intended to be used, and they instead use them as dumb data stores.

Other people have engineered new database systems from the ground up, each with a different set of tradeoffs and differences from their relational database brethren. It's these new databases that have some in our industry excited, and it's these databases that I'm going to focus on primarily in this post.

(By the way, there's a whole lot more theory behind the movement away from SQL. Primarily of interest is the CAP theorem and the Dynamo paper. Both of these illustrate the necessary tradeoffs of between different approaches to designing databases.)

Let's get this out of the way

I love SQL. More than even that, I love my precious ORM and being able to query for whatever information I want whenever I want it. For the vast majority of sites out there (we're talking 99.9% of the sites out there, folks) it suits their needs very well, providing a good balance of ease of use and performance.

There's no reason for them to switch away from SQL, and there's no way they will. If there's one thing I don't like about this whole NoSQL movement, it's the presumption that everyone who's interested in alternative databases hates the status quo. That's simply not true.

But we're not talking about most sites out there, we're not talking about the status quo, we're talking about the few applications that need something totally different.

Tokyo Cabinet / Tokyo Tyrant

Tokyo Cabinet (and its network interface, Tokyo Tyrant) is the logical successor to Berkeley DB--a blazing fast, open-source, embeddable key-value store that does just about what you would expect from its description. It supports 3 modes of operation: hashtable mode, b-tree mode, and table mode.

(Table mode still pretty much sucks, and I'm not convinced it's a good idea for the project since it's added bloat and other systems like RDBMs are probably better for storing tabular data, so I'm going to skip it.)

Essentially, the API into Tokyo Cabinet is that of a gigantic associative array. You give it a key and a value, and then later, given a key, it will give you back the value you put in. Its largest assets are that it's fast and straightforward.

If your problem is such that you have a small to medium-sized amount of data, which needs to be updated rapidly, and can be easily modeled in terms of keys and values (almost all scenarios can be rewritten in terms of keys and values, but some problems are easier to convert than others), then Tokyo Cabinet and Tokyo Tyrant are the way to go.

CouchDB

CouchDB is similar to Tokyo Cabinet in that it essentially maps keys to data, but CouchDB's philosophy is completely different. Instead of arbitrary data, its data has structure--it's a JSON object. Instead of only being able to query by keys, you can upload functions that index your data for you and then you can call those functions. All of this is done over a very simple REST interface.

But none of this really matters. None of these really set CouchDB apart, because you could just encode JSON data and store it in Tokyo Cabinet, you can maintain your own indexes of data fairly easily, and you can build a simple REST API in a matter of days, if not hours.

What really sets CouchDB apart from the pack is it's innovative replication strategy. It was written in such a way that nodes which are disconnected for long periods of time can reconnect, sync with each other, and reconcile their differences in a way that no other database (since Lotus Notes?) could do.

It's functionality that allows for interesting and new distributed types of applications and data that I think could possibly change the way we take our applications offline. I imagine that some day every computer will come with CouchDB pre-installed and it'll be a data store that we use without even knowing that we're using it.

However, I wouldn't choose it for a super high scalability site with lots of data and sharding and replication and high availability and all those buzzwords, because I'm not convinced it's the right tool for that job, but I am convinced that its replication strategy will keep it relevant for years to come.

Redis

Wow, looking at the bullet points this database seems to do just about everything, perfectly! Yeah, it's a bit prone to hyperbole and there are some great things about it, but a lot of it is hot air. For example, it claims to support sharding but really all it does is have the client run a hash function on its key and use that to determine which server to send its value to. This is something that any database can do.

When you get down to it, Redis is a key-value store which provides a richer API than something like Tokyo Cabinet. It does more operations in memory, only periodically flushing to disk, so there's more of a risk that you could lose data on a crash. The tradeoff is that it's extremely fast, and it does some neat things like allow you to append a value to the end of a list of items already stored for a given key.

It also has atomic operations. This is honestly the only reason I find this project interesting, because the atomic operation support that it has means that it can be turned into a best-of-breed tally server. If you are building a server to keep real-time counts of various things, you would be remiss to overlook Redis as a very viable option.

Cassandra

It's good to save the best for last, and that's exactly what I've done as I find Cassandra to be easily the most interesting non-relational database out there today. Originally developed by Facebook, it was developed by some of the key engineers behind Amazon's famous Dynamo database.

Cassandra can be thought of as a huge 4-or-5-level associative array, where each dimension of the array gets a free index based on the keys in that level. The real power comes from that optional 5th level in the associative array, which can turn a simple key-value architecture into an architecture where you can now deal with sorted lists, based on an index of your own specification. That 5th level is called a SuperColumn, and it's one of the reasons that Cassandra stands out from the crowd.

Cassandra has no single points of failure, and can scale from one machine to several thousands of machines clustered in different data centers. It has no central master, so any data can be written to any of the nodes in the cluster, and can be read likewise from any other node in the cluster.

It provides knobs that can be tweaked to slide the scale between consistency and availability, depending on your particular application and problem domain. And it provides a high availability guarantee, that if one node goes down, another node will step in to replace it smoothly.

Writing about all the features of Cassandra is a whole different post, but I am convinced that its data model is rich enough to support a wide variety of applications while providing the kind of extreme scalability and high availability features that few other databases can achieve--all while maintaining a lower latency than other solutions out there.

Conclusion

There are many other non-relational databases out there: HBase and Hypertable, which are replicating Google's BigTable despite its complexity and problems with single points of failure. MongoDB is another database that has been getting some traction, but it seems to be a jack of all trades, master of none. In short, the above databases are the ones that I find interesting right now, and I would use each of them for different use cases.

What do you all think about this whole non-relational database thing? Do you agree with my thoughts or do you think I'm full of it?

Easy Multi-Database Support for Django

Background

One of the most requested features in Django is that it support connecting to multiple databases at once. This can come in several flavors, but the two most common cases are sharding, and (vertical) partitioning. If you've been waching closely, some of the core developers have been saying in various places for a few months now that this is technically possible, right now, in Django 1.0.

Of course, being technically possible is a long way from being easy. Right now there is no public API for dealing with multiple databases. So why do the developers say that it's possible to do? The answer is simple: shortly before Django 1.0 was released, much of the internals of QuerySet objects (Django's interface to the database) were refactored to use object state-level connection objects instead of a global connection object.

This seemingly-small change opens the doors for multiple databases, even if there is no API in front of it. So let's create an API. We're going to be focusing on vertical partitioning, since it's slightly easier, but the technique demonstrated here will be illustrative when implementing sharding as well. Oh, and since we're poking deep into the core of Django's internals, I'm obliged to give the standard disclaimer: this is not supported and may break in future versions of Django, so use these techniques at your own risk.

First things first

The first thing that needs to be done when implementing multiple database support is to supply Django with the information about all of the databases that you would like to connect to. Here's how that should look in settings.py:

DATABASE_ENGINE = 'sqlite3'
DATABASE_NAME = 'primary.db'
DATABASE_USER = ''
DATABASE_PASSWORD = ''
DATABASE_HOST = ''
DATABASE_PORT = ''

DATABASES = dict(
    primary = dict(
        DATABASE_ENGINE=DATABASE_ENGINE,
        DATABASE_NAME=DATABASE_NAME,
        DATABASE_USER=DATABASE_USER,
        DATABASE_PASSWORD=DATABASE_PASSWORD,
        DATABASE_HOST=DATABASE_HOST,
        DATABASE_PORT=DATABASE_PORT,
    ),
    secondary = dict(
        DATABASE_ENGINE=DATABASE_ENGINE,
        DATABASE_NAME='secondary.db',
        DATABASE_USER=DATABASE_USER,
        DATABASE_PASSWORD=DATABASE_PASSWORD,
        DATABASE_HOST=DATABASE_HOST,
        DATABASE_PORT=DATABASE_PORT,
    ),
)

We have not only created the typical database information that Django requires, but we've also created a dictionary containing information about all of the databases that we intend to connect to. In this case, we are connecting to two sqlite databases in the same directory, named primary.db and secondary.db.

Let's now create an app, named blog (I know, I know, very unoriginal). The models.py will look like this:

import datetime
from django.db import models

class Post(models.Model):
    title = models.TextField()
    body = models.TextField()
    date_submitted = models.DateTimeField(default=datetime.datetime.now)

class Link(models.Model):
    url = models.URLField()
    description = models.TextField(null=True, blank=True)
    date_submitted = models.DateTimeField(default=datetime.datetime.now)

And we hook it up to the admin and settings.py in the normal manner. For more information on how to do this, follow the official tutorial. We're going to be storing the Post objects in the primary database, and the Link objects in the secondary database. Since they don't have any foreign keys, we don't have to worry about joins. (They are possible, but not easy to describe in one post.)

Multiple databases

We should probably write some code that will inspect all of our models and create only the tables that we want in each database. For the sake of simplicity and practicality of a blog post, we're not going to do that. Instead, we will simply create all of the schema on both databases. The management command to do so might look something like this (I called it multi_syncdb):

from django.core.management.base import NoArgsCommand
from django.core.management import call_command
from django.conf import settings

class Command(NoArgsCommand):
    help = "Sync multiple databases."

    def handle_noargs(self, **options):
        for name, database in settings.DATABASES.iteritems():
            print "Running syncdb for %s" % (name,)
            for key, value in database.iteritems():
                setattr(settings, key, value)
            call_command('syncdb')

All of this has been fine, but the real workhorse of multiple database support lies in the model's Manager. Let's write a multi-db aware manager right now:

from django.db import models
from django.conf import settings
from django.db.models import sql
from django.db.transaction import savepoint_state

try:
    import thread
except ImportError:
    import dummy_thread as thread

class MultiDBManager(models.Manager):
    def __init__(self, database, *args, **kwargs):
        self.database = database
        super(MultiDBManager, self).__init__(*args, **kwargs)

    def get_query_set(self):
        qs = super(MultiDBManager, self).get_query_set()
        qs.query.connection = self.get_db_wrapper()
        return qs

    def get_db_wrapper(self):
        database = settings.DATABASES[self.database]
        backend = __import__('django.db.backends.' + database['DATABASE_ENGINE']
            + ".base", {}, {}, ['base'])
        backup = {}
        for key, value in database.iteritems():
            backup[key] = getattr(settings, key)
            setattr(settings, key, value)
        wrapper = backend.DatabaseWrapper()
        wrapper._cursor(settings)
        for key, value in backup.iteritems():
            setattr(settings, key, value)
        return wrapper

    def _insert(self, values, return_id=False, raw_values=False):
        query = sql.InsertQuery(self.model, self.get_db_wrapper())
        query.insert_values(values, raw_values)
        ret = query.execute_sql(return_id)
        query.connection._commit()
        thread_ident = thread.get_ident()
        if thread_ident in savepoint_state:
            del savepoint_state[thread_ident]
        return ret

I know that's a lot of code! Let's go through each piece one-by-one. In the __init__ function, we're just taking in the name of the database that we want to use, and passing the rest into the inherited __init__ function.

get_query_set gets the QuerySet instance that it would have gotten, but replaces the connection on the query object with one provided by the manager, before returning the QuerySet. In essence, this get_db_wrapper function is doing the bulk of the work.

get_db_wrapper first gets the dictionary of the database connection information for the given database name (captured from __init__), then dynamically imports the correct database backend from Django. It then sets the global settings to the values that they should be for that database (while backing up the original settings for restoration later). Then, it initializes that database connection, and restores the settings to their original values.

Most of the database operations are done through the QuerySet, there is still one operation which takes place elsewhere--saving. To account for that, we needed to override the _insert method on the manager. In fact, all we're doing here is providing the InsertQuery with the correct connection and executing that query. Then, we need to ensure that the query is committed and do any transaction management that's necessary.

That's it!

How do we specify that one ore more models will use another database then? Because so far all that we have done is write this MultiDBManager. We will just add one line assigning the manager to our Link model. The model now looks like this:

class Link(models.Model):
    url = models.URLField()
    description = models.TextField(null=True, blank=True)
    date_submitted = models.DateTimeField(default=datetime.datetime.now)

    _default_manager = MultiDBManager('secondary')

Conclusion

The MultiDBManager can be re-used for any number of models to be partitioned on to any number of databases. The hard part is making sure that none of the models in one database reference any models in the other database. It's possible to do it, by storing the foreign key as a regular integer and querying for all of the referenced model instances through Python instead of using the database (for obvious reasons), but then it becomes much harder.

It will be great when Django provides a public API for doing this in a more transparent way, but for now this works. Please let me know if you use any of these techniques for large scale Django deployments, and if so, what were the problems that were encountered along the way?

Why use a VARCHAR when you can use TEXT?

Admit it, you've done it many times before: you create a database schema, arbitrarily guessing at upper limits to the lengths of various columns, only to later have to perform annoying schema upgrades as you change those columns to fit real-world data.

If you're using PostgreSQL, what you're doing is pointless. There is quite literally no performance benefit to be had, and possibly a performance penalty as the database needs to check the length constraint. This fact can even be found right here, in the official documentation.

If you're using MySQL with InnoDB, it's practically the same situation. The data is laid out on disk in exactly the same way for both TEXT and VARCHAR fields, as explained here. I couldn't find any resources about MyISAM other than that TEXT is stored externally, but I just fired up a test table and did some rudimentary benchmarking and the numbers were well within the margin of error.

If you're using SQLite, everything's a TEXT whether you want it to be or not ( with the notable exception of INTEGER PRIMARY KEY) so it doesn't matter what you try to specify, it will be a TEXT.

I'm as guilty as anyone else on this--I use varchar all the time! But come on, let's stop imposing these arbitrary character limits on our columns when the only reason we're doing it is for historical reasons. Is anyone with me?

Secrets of the Django ORM

You won't see this in the Django documentation, you won't see it mentioned on other blogs, and you certainly won't hear the core developers of Django boasting about it, but Django's ORM has a secret weapon: it supports SQL group_by and having clauses, and it has for quite some time.

It's not part of the public QuerySet API, but rather a part of the private Query API. But just because it's not part of the public API doesn't mean that it's not easy to use--it just means that it might change in the future. So it's really a "use at your own risk" type of deal now. If you're up for keeping on top of things so that you know what to change when the next version of Django comes out, then read on. First, let's start with some model definition:

class TumbleItem(models.Model):
    title = models.CharField(max_length=255)
    item_type = models.CharField(max_length=50)

    def __unicode__(self):
        return '%s: "%s"' % (self.item_type, self.title)

A simple tumblog item. Very simple, as in, not really useful at all. But that's OK since this is just a demonstration. To demonstrate, let's create some data:

>>> ti1 = TumbleItem.objects.create(title='Blog Post 1', item_type='blog')
>>> ti2 = TumbleItem.objects.create(title='Blog Post 2', item_type='blog')
>>> ti3 = TumbleItem.objects.create(title='Blog Post 3', item_type='blog')
>>> ti4 = TumbleItem.objects.create(title='Article Dugg 1', item_type='digg')
>>> ti5 = TumbleItem.objects.create(title='Article Dugg 2', item_type='digg')
>>> ti6 = TumbleItem.objects.create(title='Link Saved 1', item_type='link')

OK now that we've loaded some data, let's use the group_by functionality!

>>> qs = TumbleItem.objects.all()
>>> qs.query.group_by = ['item_type']
>>> item_types = [i.item_type for i in qs]
>>> item_types
[u'blog', u'digg', u'link']

There we go, it's quick, easy, and it seems to Just Work. But let's try to grab only the item_types which have more than one item:

>>> qs = TumbleItem.objects.all()
>>> qs.query.group_by = ['item_type']
>>> qs.query.having = ['COUNT(item_type) > 1']
>>> item_types = [i.item_type for i in qs]
>>> item_types
[u'blog', u'digg']

And now we've successfully used the group_by and having functionality in the Django ORM. I'm excited for some aggregation functionality to start being exposed as a public API, as I'm sure it will be more elegant than this solution, but at the same time this is a neat hidden secret in the Django ORM. Well now you have the knowledge, so you have the power, and it's up to you to use it wisely!

Database triggers aren't evil, and they actually kind of rock

Who says they suck?

Nobody that I've seen has come out and actually said that they don't like database triggers, but at the same time, Python (Django) programmers like to program in Python. And PL/pgSQL certainly is not Python. There's a tendency to do everything in Python--especially with the use of Django's dispatcher.

But there's some serious overhead with that approach, and roundtrips, and race conditions, etc. If you're using a good database, there's an alternative: you guessed it, database triggers.

Let's set up some models

Here will be our models for the remainder of this post:

class Bookmark(models.Model):
    title = models.CharField(max_length=100)
    url = models.URLField(max_length=255)

    num_votes = models.PositiveIntegerField(null=True, blank=True, default=0)
    score = models.IntegerField(null=True, blank=True, default=0)

class Vote(models.Model):
    bookmark = models.ForeignKey(Bookmark, related_name='votes')
    value = models.IntegerField()

As you can tell, we have a straightforward Bookmark and Vote models here. But there are also two denormalized fields: num_votes, and score.

Doing it in Python

The advantage of doing this in Python is that it's simple and Django supports it out of the box. Here's how the code for that would look:

from django.db.models import signals

def update_bookmark_aggregate(sender, instance, **kwargs):
    bmark = instance.bookmark
    bmark.num_votes = bmark.votes.count()
    bmark.score = sum(bmark.votes.values_list('value', flat=True))
    bmark.save(force_update=True)
signals.post_save.connect(update_bookmark_aggregate, sender=Vote)

Very simply, every time a vote is saved, the update_bookmark_aggregate function is called which updates the bookmark with its new score and num_votes.

Doing it in Pl/pgSQL

Create a new file, named management.py under your bookmarks app directory. Its contents will be as follows:

from django.db.models import signals
from bookmarks import models

sql = """
CREATE LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION update_bookmark_aggregate_trigger()
    RETURNS "trigger" AS '
    DECLARE
        new_score INTEGER;
        new_num_votes INTEGER;
    BEGIN
        SELECT COUNT(*) INTO STRICT new_num_votes FROM bookmarks_vote
            WHERE bookmark_id = NEW.bookmark_id;
        SELECT COALESCE(SUM(value), 0) INTO STRICT new_score FROM bookmarks_vote
            WHERE bookmark_id = NEW.bookmark_id;
    UPDATE bookmarks_bookmark
        SET
            score = new_score,
            num_votes = new_num_votes
        WHERE id = NEW.bookmark_id;
    RETURN NEW;
    END;' LANGUAGE 'plpgsql' VOLATILE;

CREATE TRIGGER update_bookmark_aggregate_trigger AFTER INSERT OR UPDATE
    ON bookmarks_vote FOR EACH ROW
    EXECUTE PROCEDURE update_bookmark_aggregate_trigger();
"""

def create_trigger(app, created_models, verbosity, **kwargs):
    from django.db import connection
    cursor = connection.cursor()
    cursor.execute(sql)
signals.post_syncdb.connect(create_trigger, sender=models)

In this file we have declared two variables, new_score, and new_num_votes. We do two queries to get the aggregate data. And then we update the bookmark to reflect the new aggregated values. This script is executed once when the bookmarks models are first loaded into the database, and we're all set!

Let's see how it works

>>> from bookmarks.models import Bookmark, Vote
>>> b = Bookmark.objects.create(title="Blog", url='http://eflorenzano.com/')
>>> b.num_votes
0
>>> b.score
0
# There is no aggregate data yet
>>> Vote.objects.create(bookmark=b, value=1)
<Vote: Vote object>
>>> Vote.objects.create(bookmark=b, value=2)
<Vote: Vote object>
# We need to re-query for the bookmark, due to no identity map in Django.
>>> b = Bookmark.objects.all()[0]
>>> b.num_votes
2
>>> b.score
3

Voila! This was all done in the database behind the scenes. Very cool, very fast, and it kind of rocks.

Search

Badges

  • django badge
  • apache badge
  • GeoURL
  • XFN Friendly
  • Valid HTML 4.01 Transitional