Scalability is a problem. It always has been. The solution in the past has always been either to buy more servers to cope with your peak load (at great expense), or to… I dunno. Crash, I guess. Or go broke. Or lose all your users. Scalability is a pain because you’ve got all manner of bottlenecks; I/O system, memory, and CPU are generally your big evildoers, and building a system that can handle all of those issues at once is like trying to build a building that is both the tallest, widest, and most expensive – complicated, and… expensive.

Ok, the similes suck, I get it. I’ll stop.

Enter The Cloud™. The great thing about The Cloud™ is that it’s all a nebulous, formless void, allowing you to create light and form from nothingness – in this case, servers. Let me explain.

In The Cloud™, you don’t buy a server then upgrade it later. You rent an abstraction of a server from people that (presmuably) have bought thousands of servers, each of which is more expensive than my house, and combined them together into a giant cluster of ultimate power. You rent a little space on this giant cluster and bend it to your will.

One of the domains my company owns is Cricket.com, for which we also have an associated fantasy sports league thing. Now, you’ll be forgiven if you don’t know anything about the game, because god knows I don’t. In fact, the only thing I know about Cricket is that they’re bigger fans of it in India than they are of housing, money, and oxygen combined. People from India are fanatical about the game, and so when the Indian Premier League (of Cricket; like the MLB or NHL or MBA or whatever) started up this year and we were providing the only officially licensed online fantasy sportsbook kind of game, we got some traffic.

Lots. Lots of traffic.

We had previously had the site running on a server at Serverbeach (a Peer1 company), which was fine, but it was underpowered because I had specced it out for what we needed at the time, and not for an entire subcontinent hitting the server all at once, which is what started happening this week. Which sucked. The site is running on Ruby on Rails, using MySQL, memcached, and duct tape, and was running well until the IPL started.

The site started getting pretty slow, and so I thought hey, let’s stick it in the cloud and that’ll make everything better. Brian and I, harnessing the power of the free snacks the company stocks the kitchen with, migrated the server over to a roughly equivalent machine at Mosso, which is where things start to fall apart. Not because of Mosso, per se, but because we were getting dramatically more traffic than we were planning for. Still, the timing was kind of conspicuous, and we needed to update. Upgrading the machine from a 4GB machine to 8GB provided almost no increase in performance whatsoever, and we began to become concerned.

Our first step was to fork services off. Until then, the entire site had been running on one server; we added a second server at Mosso just to run MySQL, and that provided some benefit, but nothing on the scale we needed. Enough to hobble along, but not enough long-term. That was when we decided to evolve in a different direction – horizontally.

Vertical scaling involves growing up – you have a server, you make it bigger. More memory, more CPU, more disk, and so on. Horizontal scaling involves growing *out* – more smaller servers instead of one large server. Building a cluster on Rackspace’s cluster seemed like the next logical step, and that’s what we did.

What we’ve ended up with is a database server, three web servers, and a load balancer in front of them. This solution is more scalable, and also seems to perform better. How much better? Enjoy some graphs!

fantasy.cricket.com would die at around 30 accesses/second

fantasy.cricket.com would die at around 30 accesses/second

This graph shows the apache ‘accesses per second’ for our old (8GB) server. As you can see load would rise to about 30 requests per second, after which point the server would be so overloaded that we could no longer get statistics out of it. We would still be serving pages, but only about 30 of them per second, and they would be slow – anywhere from 5-30 seconds per page load. Unacceptable. We would only start to get more data back from the server once load dropped below 30 accesses per second. This was obviously unacceptable.

One of our current servers, showing a marked increase in capacity

One of our current servers, showing a marked increase in capacity

This graph shows the same data as above, but for one of our cluster’s web servers. The other servers show nearly identical usage, so the real numbers are three times what you see here. You can see that during peak hours, we average around 15-18 accesses per second on this server. The statistics for the other two servers are similar, giving us around 45-55 accesses/sec total, and with very little latency. Whereas before we were dying at 30 accesses/sec, we’re currently averaging that throughout the whole day.

Think about that. We’re using the same software and configuration, using the same amount of total memory, and paying about the same, but we can now handle half again the load as before, and we’re spiking to three times our previous maximum load.

The moral of the story? If you’re going to move into the cloud, especially somewhere like Mosso, you have to be willing to re-think your architecture. The old plan of ‘bigger servers’ just doesn’t work – or at least, it didn’t for us – but now, for about the same price, we can handle up to three times our previous maximum load without any trouble at all.

We won’t know what our maximum is until we hit it, but I have a sneaking suspicion that it’ll be a while before that happens, if at all. Until then. I’m going to enjoy sleeping through the night without Pingdom whispering sweet nothings into my phone at 3 AM.