Russian site habr's Load Impact!
Yesterday, the Russian site habrahabr.ru wrote an article where they warned people about the habr effect (see slashdot effect) and suggested that it was prudent to use Load Impact to load test your site before getting swamped by traffic due to some popular blog or newssite (like habrahabr.ru) publishing an article about you.
This, of course, caused us to get habr'd!
We found that our system was suddenly struggling to keep up, and even though we could see that we had a big traffic spike, we didn't at first understand why the machines were having such a hard time. This is what our concurrent (simultaneous) visitor graph for the past week looks like:

Now let's see, is it possible to determine when the habrahabr article was published? Tricky.
As can be seen, our average traffic this past week has been about 30 concurrent visitors, with a max of around 45 users on the site at the same time. When Habrahabr published the article we suddenly got close to 200 concurrent visitors.
Now, our system is designed to handle more visitors than that. We have had some 300 or so concurrent visitors in the past, when articles have been published about us, but it has not caused a very big problem for our servers. Yesterday, everything slowed down to a crawl, which was very strange.
It all turned out to be due to a malfunction in our test queueing system. As each load test can require quite a lot of system resources to run, we have a queueing system that makes sure we don't try to run too many load tests at the same time. Normally, we wll only allow about a dozen concurrent load tests running. But as it turned out, the queueing system was malfunctioning, and let visitors run as many load tests as they pleased. Under normal traffic conditions, we didn't notice the problem, but when 200 habrahabr.ru visitors all started load tests at the same time, our system suddenly got quite busy.
At one point there were 180 load tests running at the same time - We were load testing close to 200 sites at once! (must be some kind of new record)
Luckily, practically all of these were small (free) tests and our load generator nodes were actually almost idling despite the excessive number of tests running. The loadimpact.com website, however, had problems. Especially the database had problems keeping up with all the writes caused by test results flowing in from so many concurrent load tests.
This situation went on between about 2 pm (Russian Moscow time - no, we're not russian, but most of the visitors from habrahabr.ru are) and 4 pm, then we found and fixed the problem with the queueing system, causing the number of running tests to go down to normal levels again. So to any of you out there who tried to use Load Impact or run a load test yesterday between 2 and 4 (noon and 2 pm UTC, or early morning in the US), please excuse us and please try again!
Another update
Some people seem to have misunderstood the numbers and what actually happened. I'll try to describe it in other words.
What we initially thought was that we just had a website visitor spike of about 200 concurrent visitors (don't confuse this with visitors per hour, HTTP requests, or visitors per day - see this article for an explanation). We couldn't understand why our system was so slow when it has been designed for up to 300 or maybe even 400 concurrent visitors (10-20x our normal traffic).
As it turned out, it wasn't the number of visitors on our website that caused our system to slow down. It was the number of load tests we were running.
People can run free load tests from our start page, and each free load test we execute means that we start up to 50 concurrent simulated users that access an external website that is to be load tested. Those 50 simulated users might load thousands of objects/resources from the external site, and the load test continuously updates our master database with information about how fast different objects on the external site are delivered to the simulated users. We can get hundreds of such load time results per second from a single load test, all of which go into the database.
Normally, we allow about a dozen concurrent load tests, but in this case a software bug made it possible to start an unlimited number of load tests. As most of the visitors were web developers interested in load testing, most of them started a free load test for their site. This meant that we at one point had about 200 load tests running at the same time, generating probably tens of thousands of database updates per second. This was more than our database server could comfortably handle.
Like everyone else, we have to judge what performance levels our systems should be able to handle, and build things accordingly. We try to make sure our system can handle at least 10 times the normal average traffic, which usually makes us able to handle a Habr or Slashdot effect, but in this case a silly little bug killed one of our most basic performance-protecting features, which kind of put a spanner in the works, so to speak.