Posts tagged cloud servers

Fantasy Cricket: Initial Migration (Part 1)

We’ve recently migrated cricket.com’s Fantasy Cricket application from a single server running at ServerBeach to a multi-server cluster running on Mosso’s Cloud Servers system. In an attempt to document the problems we’ve had and the solutions we’ve found, I’ve decided to write down my thoughts and memories, and the processes we went through, in the hopes that it will help others.

Atomic Migrations

The actual act of copying the data from one server to another is simple. It’s not difficult to take a point-in-time snapshot of a Rails app and move it to another server. The difficult part, when dealing with a dynamic website, is keeping the two servers in sync.

Because of DNS propagation issues, you will continue to have clients hitting the old server while some users start hitting the new server as well. This can cause concurrency issues; if a user is hitting the old server, then they may make modifications to their account (or in our case, modifications to their team) that will not show up on the new site once their DNS changes. This can be a pretty big issue, and we came up with a pretty useful method of dealing with the issue that eliminates concurrency issues with minimal impact on the users.

Testing the Server

The first step was to copy the app over and get it ‘working’. To do this, we used a dump of the MySQL database, using the mysqldump utility included with MySQL. We got the app up and working, tested it, and made sure that it was going to work properly (i.e. all Ruby gems were installed, etc.). Next we took a fresh dump of the database using the –master-data option to mysqldump, which includes the information necessary to set up a replication slave.

MySQL Replication

MySQL provides realtime replication between two servers, and it’s pretty simple to set up. Setting up MySQL replication between the two servers required briefly opening MySQL to the outside world (having it listen on port 3306), but we used iptables firewalling rules to ensure that no one else could connect. Once replication was set up, we knew that the new server would be an exact replica of the old server. The application was working, the database was identical, and so on. Any changes happening on the old server would happen instantly on the new server.

MySQL provides functionality called multi-master replication, which allows two servers to be written to, and have them push their changes back and forth between two servers. While I’ve set this up and used it in the past, it’s more complicated to set up and can be a pain to maintain. If replication fails in one direction, you can get inconsistent results between the two. For this reason, we wanted to ensure all of our traffic would switch over all at once – an atomic migration. Either all traffic hits the new site, or all traffic hits the old site. This is where Apache comes in.

Apache’s mod_proxy

We use Apache to run our websites, including our rails applications (using Passenger’s mod_rails). The solution we came up with was to make use of Apache’s built-in modules to ease our transition. We set up two config files, the original (which we had already) and a new configuration that, instead of serving pages, proxied all traffic to the secondary server. In essence, all traffic to the old site was transparently proxied to the new server, and the responses were proxied back. This is the config snippet we used:

This had two benefits: first of all, it was instantaneous and atomic – all users would see traffic from the same server, either the old one or the new one. We wouldn’t have any frustrating situations where one user would hit the old server and another user would hit the new one. Secondly, it’s quick and easy. If there were problems on the new server that we hadn’t forseen, it would be easy for us to switch everything back to the old server (by swapping out the config files again).

The Changeover

In order to ensure we didn’t run into any collisions with replication, our solution was to stop Apache entirely for a second, then bring it back online with the new config. This ensured that if there was any latency between the master (old) and slave (new) servers, it would catch up in the second that we were down. After we made the change, we checked, and the site was up and running. Users were all logged out, but beyond that the experience was largely smooth. Once we knew the server was working, we changed the DNS to point to the new server, and the changeover had finished. We were live on our new (virtual) server.

In Part 2, I plan to discuss the initial scaling issues we found, as we tried to learn the bounds of Mosso’s virtual servers, and of our own application and server design.

Scaling clouds: more, not bigger

Scalability is a problem. It always has been. The solution in the past has always been either to buy more servers to cope with your peak load (at great expense), or to… I dunno. Crash, I guess. Or go broke. Or lose all your users. Scalability is a pain because you’ve got all manner of bottlenecks; I/O system, memory, and CPU are generally your big evildoers, and building a system that can handle all of those issues at once is like trying to build a building that is both the tallest, widest, and most expensive – complicated, and… expensive.

Ok, the similes suck, I get it. I’ll stop.

Enter The Cloud™. The great thing about The Cloud™ is that it’s all a nebulous, formless void, allowing you to create light and form from nothingness – in this case, servers. Let me explain.

In The Cloud™, you don’t buy a server then upgrade it later. You rent an abstraction of a server from people that (presmuably) have bought thousands of servers, each of which is more expensive than my house, and combined them together into a giant cluster of ultimate power. You rent a little space on this giant cluster and bend it to your will.

One of the domains my company owns is Cricket.com, for which we also have an associated fantasy sports league thing. Now, you’ll be forgiven if you don’t know anything about the game, because god knows I don’t. In fact, the only thing I know about Cricket is that they’re bigger fans of it in India than they are of housing, money, and oxygen combined. People from India are fanatical about the game, and so when the Indian Premier League (of Cricket; like the MLB or NHL or MBA or whatever) started up this year and we were providing the only officially licensed online fantasy sportsbook kind of game, we got some traffic.

Lots. Lots of traffic.

We had previously had the site running on a server at Serverbeach (a Peer1 company), which was fine, but it was underpowered because I had specced it out for what we needed at the time, and not for an entire subcontinent hitting the server all at once, which is what started happening this week. Which sucked. The site is running on Ruby on Rails, using MySQL, memcached, and duct tape, and was running well until the IPL started.

The site started getting pretty slow, and so I thought hey, let’s stick it in the cloud and that’ll make everything better. Brian and I, harnessing the power of the free snacks the company stocks the kitchen with, migrated the server over to a roughly equivalent machine at Mosso, which is where things start to fall apart. Not because of Mosso, per se, but because we were getting dramatically more traffic than we were planning for. Still, the timing was kind of conspicuous, and we needed to update. Upgrading the machine from a 4GB machine to 8GB provided almost no increase in performance whatsoever, and we began to become concerned.

Our first step was to fork services off. Until then, the entire site had been running on one server; we added a second server at Mosso just to run MySQL, and that provided some benefit, but nothing on the scale we needed. Enough to hobble along, but not enough long-term. That was when we decided to evolve in a different direction – horizontally.

Vertical scaling involves growing up – you have a server, you make it bigger. More memory, more CPU, more disk, and so on. Horizontal scaling involves growing *out* – more smaller servers instead of one large server. Building a cluster on Rackspace’s cluster seemed like the next logical step, and that’s what we did.

What we’ve ended up with is a database server, three web servers, and a load balancer in front of them. This solution is more scalable, and also seems to perform better. How much better? Enjoy some graphs!

fantasy.cricket.com would die at around 30 accesses/second

fantasy.cricket.com would die at around 30 accesses/second

This graph shows the apache ‘accesses per second’ for our old (8GB) server. As you can see load would rise to about 30 requests per second, after which point the server would be so overloaded that we could no longer get statistics out of it. We would still be serving pages, but only about 30 of them per second, and they would be slow – anywhere from 5-30 seconds per page load. Unacceptable. We would only start to get more data back from the server once load dropped below 30 accesses per second. This was obviously unacceptable.

One of our current servers, showing a marked increase in capacity

One of our current servers, showing a marked increase in capacity

This graph shows the same data as above, but for one of our cluster’s web servers. The other servers show nearly identical usage, so the real numbers are three times what you see here. You can see that during peak hours, we average around 15-18 accesses per second on this server. The statistics for the other two servers are similar, giving us around 45-55 accesses/sec total, and with very little latency. Whereas before we were dying at 30 accesses/sec, we’re currently averaging that throughout the whole day.

Think about that. We’re using the same software and configuration, using the same amount of total memory, and paying about the same, but we can now handle half again the load as before, and we’re spiking to three times our previous maximum load.

The moral of the story? If you’re going to move into the cloud, especially somewhere like Mosso, you have to be willing to re-think your architecture. The old plan of ‘bigger servers’ just doesn’t work – or at least, it didn’t for us – but now, for about the same price, we can handle up to three times our previous maximum load without any trouble at all.

We won’t know what our maximum is until we hit it, but I have a sneaking suspicion that it’ll be a while before that happens, if at all. Until then. I’m going to enjoy sleeping through the night without Pingdom whispering sweet nothings into my phone at 3 AM.