Experiences deploying a large-scale infrastructure in Amazon EC2
Here is an older post that we’ve come across but find useful…
Post by Grig Gheorghiu
At OpenX we recently completed a large-scale deployment of one of our server farms to Amazon EC2. Here are some lessons learned from that experience.
Expect failures; what’s more, embrace them
Things are bound to fail when you’re dealing with large-scale deployments in any infrastructure setup, but especially when you’re deploying virtual servers ‘in the cloud’, outside of your sphere of influence. You must then be prepared for things to fail. This is a Good Thing, because it forces you to think about failure scenarios upfront, and to design your system infrastructure in a way that minimizes single points of failure.
As an aside, I’ve been very impressed with the reliability of EC2. Like many other people, I didn’t know what to expect, but I’ve been pleasantly surprised. Very rarely does an EC2 instance fail. In fact I haven’t yet seen a total failure, only some instances that were marked as ‘deteriorated’. When this happens, you usually get a heads-up via email, and you have a few days to migrate your instance, or launch a similar one and terminate the defective one.
Expecting things to fail at any time leads to and relies heavily on the next lesson learned, which is…
Fully automate your infrastructure deployments
There’s simply no way around this. When you need to deal with tens and even hundreds of virtual instances, when you need to scale up and down on demand (after all, this is THE main promise of cloud computing!), then you need to fully automate your infrastructure deployment (servers, load balancers, storage, etc.)
The way we achieved this at OpenX was to write our own custom code on top of the EC2 API in order to launch and destroy AMIs and EBS volumes. We rolled our own AMI, which contains enough bootstrap code to make it ‘call home’ to a set of servers running slack. When we deploy a machine, we specify a list of slack ‘roles’ that the machine belongs to (for example ‘web-server’ or ‘master-db-server’ or ‘slave-db-server’). When the machine boots up, it will run a script that belongs to that specific slack role. In this script we install everything the machine needs to do its job — pre-requisite packages and the actual application with all its necessary configuration files.
I will blog separately about how exactly slack works for us, but let me just say that it is an extremely simple tool. It may seem overly simple, but that’s exactly its strength, since it forces you to be creative with your postinstall scripts. I know that other people use puppet, or fabric, or cfengine. Whatever works for you, go ahead and use, just use SOME tool that helps with automated deployments.
The beauty of fully automating your deployments is that it truly allows you to scale infinitely (for some value of ‘infinity’ of course
. It almost goes without saying that your application infrastructure needs to be designed in such a way that allows this type of scaling. But having the building blocks necessary for automatically deploying any type of server that you need is invaluable.
Another thing we do which helps with automating various pieces of our infrastructure is that we keep information about our deployed instances in a database. This allows us to write tools that inspect the database and generate various configuration files (such as the all-important role configuration file used by slack), and other text files such as DNS zone files. This database becomes the one true source of information about our infrastructure. The DRY principle applies to system infrastructure, not only to software development.
Speaking of DNS, specifically in the context of Amazon EC2, it’s worth rolling out your own internal DNS servers, with zones that aren’t even registered publicly, but for which your internal DNS servers are authoritative. Then all communication within the EC2 cloud can happen via internal DNS names, as opposed to IP addresses. Trust me, your tired brain will thank you. This would be very hard to achieve though if you were to manually edit BIND zone files. Our approach is to automatically generate those files from the master database I mentioned. Works like a charm. Thanks to Jeff Roberts for coming up with this idea and implementing it.
While we’re on the subject of fully automated deployments, I’d like to throw an idea out there that I first heard from Mike Todd, my boss at OpenX, who is an ex-Googler. One of his goals is for us never to have to ssh into any production server. We deploy the server using slack, the application gets installed automatically, monitoring agents get set up automatically, so there should really be no need to manually do stuff on the server itself. If you want to make a change, you make it in a slack role on the master slack server, and it gets pushed to production. If the server misbehaves or gets out of line with the other servers, you simply terminate that server instance and launch another one. Since you have everything automated, it’s one command line for terminating the instance, and another one for deploying a brand new replacement. It’s really beautiful.
Design your infrastructure so that it scales horizontally
There are generally two ways to scale an infrastructure: vertically, by deploying your application on more powerful servers, and horizontally, by increasing the number of servers that support your application. For ‘infinite’ scaling in a cloud computing environment, you need to design your system infrastructure so that it scales horizontally. Otherwise you’re bound to hit limits of individual servers that you will find very hard to get past. Horizontal scaling also eliminates single points of failure.
Here are a few ideas for deploying a Web site with a database back-end so that it uses multiple tiers, with each tier being able to scale horizontally:
1) Deploy multiple Web servers behind one or more load balancers. This is pretty standard these days, and this tier is the easiest to scale. However, you also want to maximize the work done by each Web server, so you need to find the sweet spot of that particular type of server in terms of httpd processes it can handle. Too few processes and you’re wasting CPU/RAM on the server, too many and you’re overloading the server. You also need to be cognizant of the fact that each EC2 instance costs you money. It can become so easy to launch a new instance that you don’t necessarily think of getting the most out of the existing instances. Don’t go wild unless absolutely necessary if you don’t want to have a sticker shock when you get the bill from Amazon at the end of the month.
2) Deploy multiple load balancers. Amazon doesn’t yet offer load balancers, so what we’ve been doing is using HAProxy-based load balancers. Let’s say you have an HAProxy instance that handles traffic for www.yourdomain.com. If your Web site becomes wildly successful, it is imaginable that a single HAProxy instance will not be able to handle all the incoming network traffic. One easy solution for this, which is also useful for eliminating single points of failure, is to use round-robin DNS, pointing www.yourdomain.com to several IP addresses, with each IP address handled by a separate HAProxy instance. All HAProxy instances can be identical in terms of back-end configuration, so your Web server farm will get 1/N of the overall traffic from each of your N load balancers. It worked really well for us, and the traffic was spread out very uniformly among the HAProxies. You do need to make sure the TTL on the DNS record for www.yourdomain.com is low.
3) Deploy several database servers. If you’re using MySQL, you can set up a master DB server for writes, and multiple slave DB servers for reads. The slave DBs can sit behind an HAProxy load balancer. In this scenario, you’re limited by the capacity of the single master DB server. One thing you can do is to use sharding techniques, meaning you can partition the database into multiple instances that each handle writes for a subset of your application domain. Another thing you can do is to write to local databases deployed on the Web servers, either in memory or on disk, and then periodically write to the master DB server (of course, this assumes that you don’t need that data right away; this technique is useful when you have to generate statistics or reports periodically for example).
4) Another way of dealing with databases is to not use them, or at least to avoid the overhead of making a database call each time you need something from the database. A common technique for this is to use memcache. Your application needs to be aware of memcache, but this is easy to implement in all of the popular programming languages. Once implemented, you can have your Web servers first check a value in memcache, and only if it’s not there have them hit the database. The more memory you give to the memcached process, the better off you are.
Establish clear measurable goals
The most common reason for scaling an Internet infrastructure is to handle increased Web traffic. However, you need to keep in mind the quality of the user experience, which means that you need to keep the response time of the pages your serve under a certain limit which will hopefully meet and surpass the user’s expectations. I found it extremely useful to have a very simple script that measures the response time of certain pages and that graphs it inside a dashboard-type page (thanks to Mike Todd for the idea and the implementation). As we deployed more and more servers in order to keep up with the demands of increased traffic, we always kept an eye on our goal: keep reponse time/latency under N milliseconds (N will vary depending on your application). When we would see spikes in the latency chart, we knew we need to act at some level of our infrastructure. And this brings me to the next point…
Be prepared to quickly identify and eliminate bottlenecks
Related posts
- Cloud Computing Companies – List of 85 Cloud Vendor Players (7)
- Understanding cloud and devops–part 1 – The Cloud (3)
- The Future of Cloud Computing – Transforming Business and IT with Open Source Tools (8)
- Mini-glossary: Cloud computing terms you should know (14)
- Why marketers can’t ignore the cloud computing revolution (13)
No comments yet.
Cloud Computing Analysts encouraged by Microsoft’s ‘cloud’ progress
about 16 hours ago - 1 comment
// Cloud Computing Analysts encouraged by Microsoft’s ‘cloud’ progress SEATTLE (MarketWatch) — Wall Street analysts came away from Microsoft Corp.’s annual gathering encouraged by the company’s progress in adapting to a market in which software applications are increasingly delivered online, according to research reports published Friday. Microsoft increasingly has been moving into so-called cloud computing,
Cloud Computing 101 – How to Upload Files to Amazon S3
about 1 day ago - No comments
Here is a very simple little S3 tutorial For those new to Cloud Computing and AWS S3 Thanks to S3Browser for the illustrations. With S3 Browser you can upload any number of files to Amazon S3. Here are the step-by-step instructions that explain how to upload/backup your files. Start S3 Browser and select the bucket
Will cloud computing economics add up for Microsoft?
about 1 day ago - No comments
By Larry Dignan at ZDnet Microsoft at its financial analyst meeting made the case for being a cloud computing leader and argued that its economic prospects will improve as information technology shifts to an on-demand model. The big question: Do you buy the argument that cloud computing will accelerate Microsoft’s earnings and revenue growth? Let’s
50 + Of The Biggest and Best Cloud Computing Companies
about 1 day ago - No comments
Here is a cloud resource list that we’ve put together. The list has over 50 of the biggest and best up and coming Cloud Computing companies anywhere. If you’re interested in shifting to the cloud this would be a good place to start your search. We will include this in our Cloud Vendors section and
Optimizing the Virtual Cloud
about 2 days ago - 1 comment
BURBANK, CA–(Marketwire – July 29, 2010) – Cloud computing is a revolution for corporate data systems. Instead of having to install and maintain costly server hardware on-site, enterprises can now subscribe to a cloud service and literally use computing resources as they are needed. When they are no longer required, those resources are used elsewhere.
25 Resourceful Cloud Computing Blogs
about 2 days ago - No comments
Here is a good list of 25 Cloud Blogs to start the day that offer a variety of information related to Cloud Computing. Reuven Cohen’s ElasticVapor Blog CloudSecurity.org John Willis’ IT Management and Cloud Blog CloudComputing Journal HostedFTP Cloud TmForum Mobile Internet Computing Randy Bias’ blog Sam Johnston CloudCasts (Cloud Podcasts) Cloud Computing Cloud Computing
UPDATE: Microsoft COO: ‘We Are Going To Lead With the Cloud’
about 3 days ago - No comments
SAN FRANCISCO (Dow Jones)–Cloud computing tops a four-point plan Microsoft Corp. (MSFT) has developed for maintaining its position in the business-software market, Kevin Turner, the company’s chief operating officer, said Thursday. “We are going to lead with the cloud,” Turner said at the Redmond, Wash. software giant’s analyst day. “It helps position Microsoft to sell
Top 10 Cloud Computing Venture Capital Firms To Help Build Your Business
about 3 days ago - No comments
For more than 30 years, NEA has been helping to build great companies. Our committed capital has grown to $11 billion and we’ve funded more than 650 companies in the Information Technology, Energy Technology and Healthcare sectors. For more than 49 years, Norwest Venture Partners (NVP) has actively partnered with entrepreneurs to build and grow
DreamWorks signs cloud computing deal
about 3 days ago - No comments
DreamWorks SKG has signed a multi-year deal with Cerelink for cloud computing access. Instead of rendering movies like How To Train Your Dragon on thousands of its own computer cores, DreamWorks will use elastic compute resources housed in Cerelink’s supercomputing-class facility at the New Mexico Applications Centre (NMCAC). “Elastic” cloud computing allows clients like DreamWorks
CA Technologies buys its way into cloud fray
about 3 days ago - No comments
By Kevin Kwang, ZDNet Asia Its 2010 acquisition blitz to date includes Oblicore, Nimsoft and 3Tera and shows no signs of slowing as CA Technologies (CA) continues its buying spree to “plug gaps” and bulk up on technical capabilities, particularly in the cloud environment. According to Andi Mann, vice president of product marketing at CA,


