The Official Scalr blog software to auto-scale the world's websites

2Aug/113

Scalr.net IP pool changes

One of the Scalr.net servers has had its IP address changed. If you have a tight security policy and closed ports from the default, please update your security groups or other whitelists to let the following IP address communicate via ports 8013 and 8014:

Add: 184.173.242.34
Remove: 174.123.172.202

As you know, each instance in your Scalr account runs a little agent that communicates over ports 8013 and 8014 with the Scalr.net servers (at the listed IP addresses) to share load data. By not closing those ports, you let Scalr continue to operate.

More details in this wiki entry.

Filed under: Announcements 3 Comments
7Jul/110

Windows 2008: Scalr edition!

I think we're all familiar with Windows' quasi infinite amount of editions, from the Home edition to Professional edition, Starter edition, and Ultimate edition. It seems like Microsoft left out a very important edition though. That's where we come in.

We're proud to announce Windows 2008 - Scalable edition! Windows is currently one of the best platforms for transcoding (best codecs, notably), but also for .NET applications and more. And after months of hard work, we've made our guest agent (the scalarizr) work on Windows.

A Windows 2008 'base' role is now available in all EC2 regions, that you can use to install software on and use as components of your farms.

Next up will be Apache and MySQL roles for Windows, so you can use the OS for every component in your stack. :-)

Cheers,
Sebastian

Filed under: Announcements No Comments
23Jun/115

New IP in address pool

We have a new IP in our address pool: 174.37.32.18

If you use a whitelist or security groups to restrict access to your instances, you'll need to add this address to it. We'll start using it in 2 weeks to deliver messages and collect statistics from your instances.

Here's a full list of IP addresses Scalr.net uses.

Filed under: Announcements 5 Comments
1Jun/110

Beta testers wanted for secret project

Hey everyone,

We've just finished internal testing of a secret project, and we're super excited about it (at least I am!). We'd like to get feedback on it, so if you're interested in participating in a beta, please email us at beta@scalr.com and we'll get you hooked up.

Cheers,
Sebastian

Filed under: Feedback No Comments
11May/112

Using Scalr to avoid future Amazon problems: Surviving Region outages

Amazon's recent EBS outage has once again shown how important it is to architect your application to tolerate failure. This post is the second in a series focused on giving you tips to have Scalr handle failure tolerance for you.

How to survive a Region outage

We previously saw how Scalr could help you survive an Availability Zone going down, which boiled down to sticking to defaults so Scalr can take care of it for you. We also defined the concepts of Failure Tolerance and of Failure Recovery, the former maintaining availability through failure, the latter maintaining your ability to recover from it.

Most often, downtime is caused by human errors

Surviving a Region outage is a different beast altogether, as you face larger problems:

  • Network speeds: your data might not get from your web server to your database in low enough time, aka latency, which results in slower page load times, or your databases might not replicate data in fast enough volume (data not being present yet), aka throughput, which results in your data not being consistent
  • Increased cost of maintaining two running copies of your website, especially at the low end (as your traffic grows and you have enough for each copy, this issue disappears)
  • EBS limitations: volumes are limited to Availability Zones (AZs) and snapshots to Regions, so you can't use the same inexpensive magic to create volumes in another Region

But first, a little theory...

If a Region is going to be unavailable, you'll want to make sure you still have access to the three tenets of your infrastructure: your application, your configuration, and your data. Machine Images (images) store both configuration and application, and sometimes data, but you need to be careful and aware that it might have been a while since your last snapshot, and your images might have changed a lot since. If you are using images to store application, configuration, and data, make sure you snapshot your instances after every application update and configuration change, and snapshot them regularly if they contain data (such as database data and user contributed data, like blog post images).

Of course, there are drawbacks to this method, the first of which is the tedium of creating images for all the different Regions you want to operate in, and the second of which is the difficulty of keeping them all in sync and identical. Because of this, it is better practice (at scale) to separate the tiers, like you would separate load balancers, caching servers, web servers, and database servers: keep your application in a code repository, your configuration in Chef recipes or Puppet modules, and your data on some persistent store like S3. You can then use these to recreate your infrastructure at a moments notice.

You'll have to decide how to manage app, config, and data

Still following? Good! Here comes the juicy part.

Region Failure Recovery

The easiest way to survive Region outages is to keep a database slave in a separate Region (used to continuously replicate data from your main database), and make sure your images are available in both Regions. Lets say you operate out of us-east, and keep your load balancer, web servers, etc. there. You read this article, and diligently set up a slave in us-west. Now us-east goes down, uh-oh. Well, not so uh-oh since you have your application, configuration, and data available in us-west. Edit the us-west farm that contains the slave, and add all the components that your application requires (or change the max-instances value to something >0 for each role if you did so already). Now update your DNS zone in Scalr to the new load balancer, and Scalr will update your A records so traffic goes to your new infrastructure.

You have successfully recovered from failure, and your site is up and running!

 

Success! You recovered from a Region failure!

 

Disadvantages of this method? It's fairly manual, and it can be expensive to keep a spare instance running just for backup.

Doing it on the Cheap

Could you get away with a micro instance as the slave? Depending on the write rate, applying binary logs to actual data can be pretty disk and cpu intensive, and the micro instance might start falling behind (increasing value of seconds_behind_master). If that happens and disaster strikes, you'll be missing some data that you can only recover manually, and only when the offline Region is available again. Up to you to decide where you stand in the cost vs data-completion tradeoff.

Could you get away with a less manual recovery? Unfortunately, automatic region outage detection is very complicated, and automatic assessment of amplitude and duration even harder. This results in a significant chance of false positives which, combined with master-slave replication not being easily reversible, makes it safer to keep a manual process.

Region Fault Tolerance

What about Failure Tolerance? If you want to go all-out and not care about Regions going up and down, full blown master-master replication is a good option. Create two (or more) farms that include mysql instances (choose big fat ones, like the 32GB ones), and configure replication between them: this is known as master-master replication. Then let each replicate separately to their slaves in usual Scalr manner. Remember to use MySQL's key offsets to avoid running into primary key collisions. If you have two masters, you can set the first to only create even primary keys and the second to only create odd keys.

Worse case scenario

If a Region becomes unavailable and you forgot to prepare for the eventuality, you can always create a new farm and load data from the last backup made for you. When the Region comes back up you can reconcile the differences.

How Scalr helps

  • Scalr updates your DNS zone to make manual switchover painless
  • Scalr automatically backs up your mysql data
  • Scalr auto-scales capacity to accomodate redirected traffic
  • plus everything from the previous post

What Scalr is working on to make this easier

Monitoring and alerting. Starting with the next release, we'll allow you to set up monitors and alerts so you can get notified when Scalr adds or removes capacity for you, but also when bad things happen. Like if all instances in a Region are inaccessible (sign of a Region outage). If we get around to it, we'll also add some aggregate intelligence, so you can compare your infrastructure to the aggregate (is it me or is it every Scalr user?).

Different datastores. We're adding MongoDB, restoring Memcache, and continuing to work on Cassandra to give you more options for storing and querying your data, and being able to access it despite outages.

Regular snapshotting of instances. We advise against this, as replacing instances automatically can result in lost data, but creating snapshots without replacement (backups essentially) can be useful. Looking into it.

Easier set up of replication. We're looking into making it easier to set up slave replication on servers that are not part of the same farm, for that lone slave server on another Region

Master-Master. This has been asked many times now, but it has a tendency to be brittle and we fear the costs of supporting it.

Farm cloning. We're adding the ability to clone a farm so you can deploy copies of it in different Region. These clones will be complete with data and configuration, so Dev/Test is a natural fit too.

Next in this series: How to survive a Cloud outage, and How to survive degraded functionality

Filed under: Tips 2 Comments
10May/110

Deprecating Scalr’s ami-scripts agent

About a year ago, we made a significant architectural change to Scalr, and rewrote about 15k lines of code in the process. This rewrite was required to support multiple Clouds like EC2, Rackspace Cloud, but also to take infrastructure management to new levels of convenience and automation. Part of this rewrite involved creating a new guest agent to replace the kludge of bash scripts we hacked together when we started the project.

We've now hit a spot where we can't make the old agent, scalr-ami-scripts, support the cool new features (like Monitoring and Alerting!) coming out, or even some of the ones we released in 2.1, or 2.2. So to get them, you'll have to upgrade to the secure & faster scalarizr.

We wrote up some documentation to guide you through the process here: http://wiki.scalr.net/Tutorials/AMI_Scripts_to_Scalarizr_transition

This old agent will continue to be supported for the next 3 months, after which, if you haven't upgraded, we'll send pirate-ninja-cyborg-jesus after you:

100% badass

Filed under: Announcements No Comments
29Apr/110

Roadtrip across America

Wow! It's been a busy month of April!

I wrote a while ago that we wanted to go on a road trip to meet our users and customers. Well guess what? We did it!

We landed in New York City early April, in the evening, took the City's very efficient public transportation system to the fashion district in Manhattan, and spent the night in a nice hotel after blowing our entire recent round of financing on a meal. For three of us, it was our first time in New York so we did a little sight-seeing, including Central Park and the beautifully architected Apple Store which was designed by Bohlin Cywinski Jackson, and the Guggenheim museum which has a jaw-dropping collection of modern art ranging from Kandinsky to Braque. After New York we visited Boston, with it's Harvard and MIT campuses (if you're a fan of Frank Gehry's work, check out MIT's Stata Center), then headed towards Washington DC.

We met clients in each of of these cities, and learned quite a lot from our discussion with them, but even more from observing them use Scalr. I can now say that user testing rocks! We'll have a separate post on this later; meanwhile, here are a few pictures we took along the way:

After Washington DC, we set our GPS destination to Miami, with a waypoint in Atlanta to meet a client there. It was a rather long drive, and we only stopped for food and gas. It wasn't as much fun as we hoped it would be, so we kept it interesting by brainstorming new ideas to make Scalr the best tool for scaling web applications (and when we got bored of that we talked about movies, technology, and sometimes cultural differences between the countries we've lived in). We didn't stay long in Atlanta - perhaps all of 2 hours - then continued to drive down to Florida, where we didn't stay much either.

After meeting with a long-time customer there, we proceeded to the longest drive I've ever done: Miami to San Francisco, going through Dallas, Santa Fe, and Las Vegas. We took shifts driving and sleeping, but even with that it was very tiring. We didn't have any meetings with clients along the way, so we wanted to get to the other coast as soon as possible. An evening was spent in Dallas, where we unanimously voted to go to a steakhouse (the Silver Fox, it was awesome), before driving through the night to Santa Fe (arrived in the morning). I have to add here that Santa Fe is a beautiful city, and we spent half a day there visiting the art galleries on Canyon road.

We then drove to the Grand Canyon, where we took these beautiful pictures from Antelope Canyon, then drove to Las Vegas (stayed at the Bellagio!), and finally Los Angeles and San Francisco...

... For a total of 5300 miles driven in 10 days. Let me say that after that, when we closed our eyes, we could still see things coming at us!

That's the story of our roadtrip! Now if you want to know what we learned from the trip, check out Roadtrip: Collecting Feedback.

Filed under: Roadtrip No Comments
29Apr/110

Come meet the Scalr team in Mountain View

Hey everyone,

The Scalr team will be present at tomorrow's Cloud Computing workshop, "Talk Cloudy to Me!", organized by the Silicon Valley Cloud Computing Group. It's free to attend, so come along to meet us!

Agenda: http://talkcloudy2011.sched.org/
Registration: http://www.meetup.com/cloudcomputing/events/16701362/

The event will also be recorded and streamed, so you can watch it remotely.

See you there, online or not!

Cheers,
The Scalr team

Filed under: Events No Comments
27Apr/110

New MySQL roles available- with Percona!

Those of you familiar with mysqlperformanceblog.com probably know about Percona already, but if you don't, you might want to subscribe to the feed.  Peter Zaitsev blogs there, and he (along with Morgan Tocker) writes frequently on how to squeeze efficiency out of your mysql servers, minimize downtime, and more.

There's a couple of reasons to prefer Percona Server over MySQL. I'd like to note that it's a drop-in replacement, fully compatible with MySQL, and that it is entirely free of charge and under an open source license.

  • Percona server uses the XtraDB engine, a enhancement to InnoDB, which runs queries faster and with more execution-time consistency, and includes somes tools to reduce the guesswork when troubleshooting errors. It also reduces downtime on servers with slow disks and large memory, such as 4XL EC2 servers on EBS volumes.
  • There's also the XtraBackup tool, which is hot backup software that performs non-blocking backups. With it, backups complete quickly and reliably, don't interrupt transaction processing, and save disk space + network bandwidth.

Sound good? We've got good news for you then: Percona Server is now available for Scalr!

Percona Server now available on Scalr

To get access to it, simply go to the Role Builder under Roles in the top menu, choose the Cloud and OS you want to run in, select MySQL from the list of software to install, then choose Percona Server from the drop down and create. Easy!

Adding Percona Server for CentOS

You'll end up with a brand new Role that you can use in your Server Farms.

Brand new 64bit Percona server

We're all fans of Percona here at Scalr, and suggest you take the time to try it out. Commercial support and consulting is available from them too.

Filed under: Feature No Comments
25Apr/114

Using Scalr to avoid future Amazon problems: Surviving AZ outages

Amazon's recent EBS outage has once again shown how important it is to architect your application to tolerate failure. This post is the first in a series focused on giving you tips to have Scalr handle failure tolerance for you.

Failure recovery & Failure tolerance

There are two concepts I'd like to define: failure recovery, and failure tolerance. Failure recovery is when a disaster can take your application down, but the app can nevertheless recover from it. Failure tolerance is when a disaster disrupts service, but does not take the application down.

Failure tolerance is more expensive than failure recovery, since you must have redundant servers running. If your application is important enough to you, you might prefer failure tolerance over failure recovery.

That said, how can you get Scalr to handle failure for you?

How to survive an AZ outage

Provided you follow a few best practices, it's very easy to have Scalr make your site tolerate failure or recover from it.

  • First, make sure your user uploaded content (uploaded images for a blog post, pdfs attached to a wiki) is stored on persistent storage like S3 or Cloud Files. We all know that storage on instances is ephemeral, so this is not only a best practice but the only working practice. As an alternative, you can rsync these files between servers, or use software like Gluster.
  • Second, you should leave Scalr's placement default to "AWS-chosen" or choose "Distribute equally". With these choices, Scalr will be able to launch instances in another AZ for you should one or more fail. If you set your load balancer and application / web servers to this, you'll continue serving pages through failure.
  • Same applies to your database: select "AWS-chosen" or "Distribute equally". With these, you'll have mysql slave servers in AZs other than the one your master is in, so in the event the AZ that contains your master goes down, we'll be able to promote one of the running slaves to become the new master.

Say you have a load balancer, web server, and database server (with EBS volume) in the availability zone A. If A goes down, we'll spin up three similar instances in zone B, take the latest backup snapshot that we made for you of the EBS volume, and mount it on the new database server so you have your data again. Once A comes back online again, you can then recover the data between backup and outage.

How Scalr helps

  • Scalr automatically creates volumes from recent snapshots, and mounts them on your database
  • Scalr automatically promotes slave databases to masters
  • Scalr automatically updates the database endpoints so your application doesn't read/write data to a dead IP
  • Scalr automatically launches instances in other AZs  to scale with the increased traffic on remaining instances
  • Scalr automatically updates the load balancer to stop forwarding traffic to the dead web servers
  • Scalr automatically distributes your instances across AZs

What Scalr is working on to make this easier

To make this even easier, we changed the defaults for mysql to make snapshots automatic every 24 hours, rotate them 10 times (which means we discard the 11th), and run a backup every 12 hours. The new Scalarizr agent also lets you run mysql instances across multiple AZs.

We also renamed "Choose Randomly" to the more descriptive "AWS-chosen", and "Place in different zones" to "Distribute equally".

Finally, we changed the default placement for images to be "Distribute equally".

Next in this series: How to survive a Region outage, How to survive a Cloud outage, and How to survive degraded functionality

Filed under: Tips 4 Comments