You care about Facebook, you just might not know it yet

I recently had a conversation with a few friends that I know through the Python community. These are people who I respect a great deal, and look to for advice and insight when it comes to web development. During the course of this conversation, the subject of Facebook came up. What I wasn't expecting at all was that a large number of them said that they have no interest in Facebook, and quite a few of them proudly didn't even have accounts.

To put it bluntly, I believe that ignorance of Facebook is a major handicap today, both for developers and for entrepreneurs, and people who are not paying attention to it are burying their head in the sand.

Facebook has gotten to a point where it's the destination where the most time is spent online. It gets an estimated 260 billion (that's a B on that number) pageviews per month. In a single week, new properties are able to garner 8.6 million new active users. Sites which implement Facebook Connect typically see a 1.3x-2x boost in registrations. Companies building on Facebook's platform are being sold for 400 million dollars and are making 100 million dollars yearly. There are over 300 million active registered users. In other words, if you talk to a person that spends any time on the internet at all, they likely have a Facebook account. These numbers, by the way, are trending up and to the rightâ„¢.

Take the word Facebook out of the above paragraph, and replace it with icanhazcheeseburger. Replace it with 4chan, or somethingawful, or barbie, or anything. No matter what you replace as the company name, it doesn't change the fact that those numbers are compelling enough to be irresponsible to ignore.

Let me be perfectly clear: I don't particularly enjoy using Facebook. I find its UI cluttered, its privacy controls confusing, and its content fairly trivial. From the development side, Facebook's APIs are clunky at best. I'm definitely not advocating that you should log in every day and love it. I don't. The mass populous, however, does. I'm simply advocating that you should at least have an account, log in every once in a while, and keep tabs on the announcements that they make for developers.

As people spend more and more time consuming and producing content inside of the Facebook ecosystem, it's going to be those who change and embrace it who succeed, and those who fail to adapt that stand to lose. Note that I said producing content "inside of the Facebook ecosystem", and not "inside Facebook", as Facebook has been making very strong pushes to extend the reach of its platform well outside of its destination site. You can augment your site to have people pushing content into Facebook from your own destination sites.

Facebook is willing to give you access to a person's information, to their entire social graph, and to allow your user to become an advocate of your site, all while making registration on your site as simple as one click. Sure, you may evaluate all of that as an option and decide that it's not important enough to implement, but surely it's an important enough option to warrant your cognizance.

Do I think that Facebook will be relevant in 10 years? Probably. I'm not willing to bet much on it. One thing you can be absolutely sure about though, is that if another service with this much power comes along, I'll have an account and be acutely aware of its developer resources.

My Thoughts on NoSQL

Over the past few years, relational databases have fallen out of favor for a number of influential people in our industry. I'd like to weigh in on that, but before doing so, I'd like to give my executive summary of the events leading up to this movement:

In the late nineties and early thousands, websites were mostly read-only--a publisher would create some content and users would consume that content. The data access patterns for these types of applications became very well-understood, and as a result many tools were created and much research and development was done to further develop these technologies.

As the web has grown more social, however, more and more it's the people themselves who have become the publishers. And with that fundamental shift away from read-heavy architectures to read/write and write-heavy architectures, a lot of the way that we think about storing and retrieving data needed to change.

Most people have done this by relying less on the features provided by traditional relational databases and engineering more database logic in their application code. Essentially, they stop using relational databases the way they were intended to be used, and they instead use them as dumb data stores.

Other people have engineered new database systems from the ground up, each with a different set of tradeoffs and differences from their relational database brethren. It's these new databases that have some in our industry excited, and it's these databases that I'm going to focus on primarily in this post.

(By the way, there's a whole lot more theory behind the movement away from SQL. Primarily of interest is the CAP theorem and the Dynamo paper. Both of these illustrate the necessary tradeoffs of between different approaches to designing databases.)

Let's get this out of the way

I love SQL. More than even that, I love my precious ORM and being able to query for whatever information I want whenever I want it. For the vast majority of sites out there (we're talking 99.9% of the sites out there, folks) it suits their needs very well, providing a good balance of ease of use and performance.

There's no reason for them to switch away from SQL, and there's no way they will. If there's one thing I don't like about this whole NoSQL movement, it's the presumption that everyone who's interested in alternative databases hates the status quo. That's simply not true.

But we're not talking about most sites out there, we're not talking about the status quo, we're talking about the few applications that need something totally different.

Tokyo Cabinet / Tokyo Tyrant

Tokyo Cabinet (and its network interface, Tokyo Tyrant) is the logical successor to Berkeley DB--a blazing fast, open-source, embeddable key-value store that does just about what you would expect from its description. It supports 3 modes of operation: hashtable mode, b-tree mode, and table mode.

(Table mode still pretty much sucks, and I'm not convinced it's a good idea for the project since it's added bloat and other systems like RDBMs are probably better for storing tabular data, so I'm going to skip it.)

Essentially, the API into Tokyo Cabinet is that of a gigantic associative array. You give it a key and a value, and then later, given a key, it will give you back the value you put in. Its largest assets are that it's fast and straightforward.

If your problem is such that you have a small to medium-sized amount of data, which needs to be updated rapidly, and can be easily modeled in terms of keys and values (almost all scenarios can be rewritten in terms of keys and values, but some problems are easier to convert than others), then Tokyo Cabinet and Tokyo Tyrant are the way to go.

CouchDB

CouchDB is similar to Tokyo Cabinet in that it essentially maps keys to data, but CouchDB's philosophy is completely different. Instead of arbitrary data, its data has structure--it's a JSON object. Instead of only being able to query by keys, you can upload functions that index your data for you and then you can call those functions. All of this is done over a very simple REST interface.

But none of this really matters. None of these really set CouchDB apart, because you could just encode JSON data and store it in Tokyo Cabinet, you can maintain your own indexes of data fairly easily, and you can build a simple REST API in a matter of days, if not hours.

What really sets CouchDB apart from the pack is it's innovative replication strategy. It was written in such a way that nodes which are disconnected for long periods of time can reconnect, sync with each other, and reconcile their differences in a way that no other database (since Lotus Notes?) could do.

It's functionality that allows for interesting and new distributed types of applications and data that I think could possibly change the way we take our applications offline. I imagine that some day every computer will come with CouchDB pre-installed and it'll be a data store that we use without even knowing that we're using it.

However, I wouldn't choose it for a super high scalability site with lots of data and sharding and replication and high availability and all those buzzwords, because I'm not convinced it's the right tool for that job, but I am convinced that its replication strategy will keep it relevant for years to come.

Redis

Wow, looking at the bullet points this database seems to do just about everything, perfectly! Yeah, it's a bit prone to hyperbole and there are some great things about it, but a lot of it is hot air. For example, it claims to support sharding but really all it does is have the client run a hash function on its key and use that to determine which server to send its value to. This is something that any database can do.

When you get down to it, Redis is a key-value store which provides a richer API than something like Tokyo Cabinet. It does more operations in memory, only periodically flushing to disk, so there's more of a risk that you could lose data on a crash. The tradeoff is that it's extremely fast, and it does some neat things like allow you to append a value to the end of a list of items already stored for a given key.

It also has atomic operations. This is honestly the only reason I find this project interesting, because the atomic operation support that it has means that it can be turned into a best-of-breed tally server. If you are building a server to keep real-time counts of various things, you would be remiss to overlook Redis as a very viable option.

Cassandra

It's good to save the best for last, and that's exactly what I've done as I find Cassandra to be easily the most interesting non-relational database out there today. Originally developed by Facebook, it was developed by some of the key engineers behind Amazon's famous Dynamo database.

Cassandra can be thought of as a huge 4-or-5-level associative array, where each dimension of the array gets a free index based on the keys in that level. The real power comes from that optional 5th level in the associative array, which can turn a simple key-value architecture into an architecture where you can now deal with sorted lists, based on an index of your own specification. That 5th level is called a SuperColumn, and it's one of the reasons that Cassandra stands out from the crowd.

Cassandra has no single points of failure, and can scale from one machine to several thousands of machines clustered in different data centers. It has no central master, so any data can be written to any of the nodes in the cluster, and can be read likewise from any other node in the cluster.

It provides knobs that can be tweaked to slide the scale between consistency and availability, depending on your particular application and problem domain. And it provides a high availability guarantee, that if one node goes down, another node will step in to replace it smoothly.

Writing about all the features of Cassandra is a whole different post, but I am convinced that its data model is rich enough to support a wide variety of applications while providing the kind of extreme scalability and high availability features that few other databases can achieve--all while maintaining a lower latency than other solutions out there.

Conclusion

There are many other non-relational databases out there: HBase and Hypertable, which are replicating Google's BigTable despite its complexity and problems with single points of failure. MongoDB is another database that has been getting some traction, but it seems to be a jack of all trades, master of none. In short, the above databases are the ones that I find interesting right now, and I would use each of them for different use cases.

What do you all think about this whole non-relational database thing? Do you agree with my thoughts or do you think I'm full of it?

Flojax: A unobtrusive and easy strategy for creating AJAX-style web applications

Writing AJAX-style web applications can be very tedious. If you're using XML as your transport layer, you have to parse the XML before you can work with it. It's a bit easier if you're using JSON, but once you have parsed the data, the data still needs to be turned into HTML markup that matches the current markup on the page. Finally, the newly created markup needs to be inserted into the correct place in the DOM, and any event handlers need to be attached to the appropriate newly-inserted markup.

So there's the parsing, the markup assembly, the DOM insertion, and finally the event handler attachment. Most of the time, people tend to write custom code for each element that needs asynchronous updating. There are several drawbacks with this scenario, but the most frustrating part is probably that the presentation logic is implemented twice--once in a templating language on the server which is designed specifically for outputting markup, and again on the client with inline Javascript. This leads to problems both in the agility and in the maintainability of this type of application.

With flojax, this can all be accomplished with one generalized implementation. The same server-side logic that generates the data for the first synchronous request can be used to respond to subsequent asynchronous requests, and unobtrusive attributes specify what to do for the rest.

The Basics

The first component for creating an application using the flojax strategy is to break up the content that you would like to reload asynchronously into smaller fragments. As a basic example of this, let's examine the case where there is a panel of buttons that you would like to turn into asynchronous requests instead of full page reloads.

The rendered markup for a fragment of buttons could look something like this:

<div class="buttons">
    <a href="/vote/up/item1/">Vote up</a>
    <a href="/vote/down/item1/">Vote down</a>
    <a href="/favorite/item1/">Add to your favorites</a>
</div>

In a templating language, the logic might look something like this:

<div class="buttons">
    {% if voted %}
        <a href="/vote/clear/{{ item.id }}/">Clear your vote</a>
    {% else %}
        <a href="/vote/up/{{ item.id }}/">Vote up</a>
        <a href="/vote/down/{{ item.id }}/">Vote down</a>
    {% endif %}
    {% if favorited %}
        <a href="/favorite/{{ item.id }}/">Add to your favorites</a>
    {% else %}
        <a href="/unfavorite/{{ item.id }}/">Remove from your favorites</a>
    {% endif %}
</div>

(Typically you wouldn't use anchors to do operations that can change state on the server, so you can imagine this would be accomplished using forms. However, for demonstration and clarity purposes I'm going to leave these as links.)

Now that we have written a fragment, we can start using it in our larger templates by way of an include, which might look something like this:

...
<p>If you like this item, consider favoriting or voting on it:</p>
{% include "fragments/buttons.html" %}
...

To change this from being standard links to being asynchronously updated, we just need to annotate a small amount of data onto the relevant links in the fragment.

<div class="buttons">
    {% if voted %}
        <a href="/vote/clear/{{ item.id }}/" class="flojax" rel="buttons">Clear your vote</a>
    {% else %}
        <a href="/vote/up/{{ item.id }}/" class="flojax" rel="buttons">Vote up</a>
        <a href="/vote/down/{{ item.id }}/" class="flojax" rel="buttons">Vote down</a>
    {% endif %}
    {% if favorited %}
        <a href="/favorite/{{ item.id }}/" class="flojax" rel="buttons">Add to your favorites</a>
    {% else %}
        <a href="/unfavorite/{{ item.id }}/" class="flojax" rel="buttons">Remove from your favorites</a>
    {% endif %}
</div>

That's it! At this point, all of the click events that happen on these links will be changed into POST requests, and the response from the server will be inserted into the DOM in place of this div with the class of "buttons". If you didn't catch it, all that was done was to add the "flojax" class onto each of the links, and add a rel attribute that refers to the class of the parent node in the DOM to be replaced--in this case, "buttons".

Of course, there needs to be a server side component to this strategy, so that instead of rendering the whole page, the server just renders the fragment. Most modern Javascript frameworks add a header to the request to let the server know that the request was made asynchronously from Javascript. Here's how the code on the server to handle the flojax-style request might look (in a kind of non-web-framework-specific Python code):

def vote(request, direction, item_id):
    item = get_item(item_id)

    if direction == 'clear':
        clear_vote(request.user, item)
    elif direction == 'up':
        vote_up(request.user, item)
    elif direction == 'down':
        vote_down(request.user, item)

    context = {'voted': direction != 'clear', 'item': item}

    if request.is_ajax():
        return render_to_response('fragments/buttons.html', context)

    # ... the non-ajax implementation details go here

    return render_to_response('items/item_detail.html', context)

There are several advantages to writing your request handlers in this way. First, note that we were able to totally reuse the same templating logic from before--we just render out the fragment instead of including it in a larger template. Second, we have provided a graceful degradation path where users without javascript are able to interact with the site as well, albeit with a worse user experience.

That's really all there is to writing web applications using the flojax strategy.

Implementation Details

I don't believe that the Javascript code for this method can be easily reused, because each web application tends to have a different way of showing errors and other such things to the user. In this post, I'm going to provide a reference implementation (using jQuery) that can be used as a starting point for writing your own versions. The bulk of the work is done in a function that is called on every page load, called flojax_init.

function flojax_clicked() {
    var link = $(this);
    var parent = link.parents('.' + link.attr('rel'));

    function successCallback(data, textStatus) {
        parent.replaceWith(data);
        flojax_init();
    }
    function errorCallback(request, textStatus, errorThrown) {
        alert('There was an error in performing the requested operation');
    }

    $.ajax({
        'url': link.attr('href'),
        'type': 'POST',
        'data': '',
        'success': successCallback,
        'error': errorCallback
    });

    return false;
}

function flojax_init() {
    $('a.flojax').live('click', flojax_clicked);
}

There's really not a lot of code there. It POSTS to the given URL and replaces the specified parent class with the content of the response, and then re-initializes the flojax handler. The re-initialization could even be done in a smarter way, as well, by targeting only the newly inserted content. Also, you might imagine that an alert message probably wouldn't be such a great user experience, so you could integrate error messages into some sort of Javascript messaging or growl-style system.

Extending Flojax

Often times you'll want to do other things on the page when the asynchronous request happens. For our example, maybe there is some kind of vote counter that needs to be updated or some other messages that need to be displayed.

In these cases, I have found that using hidden input elements in the fragments can be useful for transferring that information from the server to the client. As long as the value in the hidden elements adheres to some predefined structure that your client knows about (it could even be something like JSON if you need to go that route).

If what you want can't be done by extending the fragments in this way, then flojax isn't the right strategy for that particular feature.

Limitations

This technique cannot solve all of the world's problems. It can't even solve all of the problems involved in writing an AJAX-style web application. It can, however, handle a fair amount of simple cases where all you want to do is quickly set up a way for a user's action to replace content on a page.

Some specific examples of things that flojax can't help with are if a user action can possibly update many items on a page, or if something needs to happen without a user clicking on a link. In these situations, you are better off coding a custom solution instead of trying to shoehorn it into the flojax workflow.

Conclusion

Writing AJAX-style web applications is usually tedious, but using the techniques that I've described, a large majority of the tedious work can be reduced. By using the same template code for rendering the page initially as with subsequent asynchronous requests, you ensure that code is not duplicated. By rendering HTML fragments, the client doesn't have to go through the effort of parsing the output and converting the result into correct DOM objects. Finally, by using a few unobtrusive conventions (like the rel attribute and the flojax class), the Javascript code that a web application developer writes is able to be reused again and again.

I don't believe that any of the details that I'm describing are new. In fact, people have been doing most of these things for years. What I think may in fact be new is the generalization of the sum of these techniques in this way. It's still very much a work in progress, though. As I use flojax more and more, I hope to find not only places where it can be extended to cover more use cases, but also its limitations and places where it makes more sense to use another approach.

What do you think about this technique? Are you using any techniques like this for your web applications? If so, how do they differ from what I've described?

Tagging cache keys for O(1) batch invalidation

Recently I've been spending some quality time trying to decrease page load times and decrease the number of database accesses on a site I'm working on. As you would probably suspect, that means dealing with caching. One common thing that I need to do, however, is invalidate a large group of cache keys when some action takes place. I've devised a pattern for doing this, and while I'm sure it's not novel, I haven't seen any recent write-ups of this technique. The base idea is that we're going to add another thin cache layer, and use the value from that first layer in the key to the second layer.

First, let me give a concrete example of the problem that I'm trying to solve. I'm going to use Django/Python from here on in, but you could substitute anything else, as this pattern should work across other frameworks and even other languages.

import datetime
from django.db import models

class Favorite(models.Model):
    user = models.ForeignKey(User)
    item = models.ForeignKey(Item)
    date_added = models.DateTimeField(default=datetime.datetime.now)

    def __unicode__(self):
        return u'%s has favorited %s' % (self.user, self.item)

Given this model, now let's say that we have a function that gets the Favorite instances for a given user, which might look like this:

def get_favorites(user, start=None, end=None):
    faves = Favorite.objects.filter(user=user)
    return list(faves[start:end])

There's not much here yet--we're simply filtering to only include the Favorite instances for the given user, slicing it based on the given start and end numbers, and forcing evaluation before returning a list. Now let's start thinking about how we will cache this. We'll start by just implementing a naive cache strategy, which in this case simply means that the cache is never invalidated:

from django.core.cache import cache

def get_favorites(user, start=None, end=None):
    key = 'get_favorites-%s-%s-%s' % (user.id, start, end)
    faves = cache.get(key)
    if faves is not None:
        return faves
    faves = Favorite.objects.filter(user=user)[start:end]
    cache.set(key, list(faves), 86400 * 7)
    return faves

Now we come to the hard part: how do we invalidate those cache keys? It's especially tricky because we don't know exactly what keys have been created. What combinations of start/end have been given? We could invalidate all combinations of start/end up to some number, but that's horribly inefficient and wasteful. So what do we do? My solution is to introduce another layer. Let me explain with code:

import uuid
from django.core.cache import cache

def favorite_list_hash(user):
    key = 'favorite-list-hash-%s' % (user.id,)
    cached_key_hash = cache.get(key)
    if cached_key_hash:
        key_hash = cached_key_hash
    else:
        key_hash = str(uuid.uuid4())
        cache.set(key, key_hash, 86400 * 7)
    return (key_hash, not cached_key_hash)

Essentially what this gives us is a temporary unique identifier for each user, that's either stored in cache or generated and stuffed into the cache. How does this help? We can use this identifier in the keys to the get_favorites function:

from django.core.cache import cache

def get_favorites(user, start=None, end=None):
    key_hash, created = favorite_list_hash(user)
    key = 'get_favorites-%s-%s-%s-%s' % (user.id, start, end, key_hash)
    if not created:
        faves = cache.get(key)
        if faves is not None:
            return faves
    faves = Favorite.objects.filter(user=user)[start:end]
    cache.set(key, list(faves), 86400 * 7)
    return faves

As you can see, the first thing we do is grab that hash for the user, then we use it as the last part of the key for the function. The whole if not created thing is just an optimization that helps to avoid cache fetches when we know they will fail. Here's the great thing now: invalidating all of the different cached versions of get_favorite for a given user is a single function call:

from django.core.cache import cache

def clear_favorite_cache(user):
    cache.delete('favorite-list-hash-%s' % (user.id,))

By deleting that single key, the next time get_favorites is called, it will call favorite_list_hash which will result in a cache miss, which will mean it will generate a new unique identifier and stuff it in cache, meaning that all of the keys for get_favorites are instantly different. I think that this is a powerful pattern that allows for coarser-grained caching without really sacrificing much of anything.

There is one aspect of this technique that some people will not like: it leaves old cache keys around taking up memory. I don't consider this a problem because memory is cheap these days and Memcached is generally smart about evicting the least recently used data.

I'm interested though, since I don't see people posting much about nontrivial cache key generation and invalidation. How are you doing this type of thing? Are most people just doing naive caching and calling that good enough?

Notice Something Different?

Several months ago after rolling out the lifestreaming features on this site, I became unhappy with the way that the design scaled with the new feature. Around that time I did a purely-visual refactor (just templates and CSS changed--no code), but it has sat untouched on my hard drive for several months.

Today I ran across that redesign and since it essentially looked complete, I deployed it to the site. I'm still not very happy with the design, but I think that this time it's at least easier to read than the last one, which had become very cluttered and difficult to navigate and read.

Let me know what you think in the comments.

Search

Badges

  • django badge
  • apache badge
  • GeoURL
  • XFN Friendly
  • Valid HTML 4.01 Transitional