Blog

Habrahabr перегрузка!

Russian site habr's Load Impact!

Yesterday, the Russian site habrahabr.ru wrote an article where they warned people about the habr effect (see slashdot effect) and suggested that it was prudent to use Load Impact to load test your site before getting swamped by traffic due to some popular blog or newssite (like habrahabr.ru) publishing an article about you.

This, of course, caused us to get habr'd!

We found that our system was suddenly struggling to keep up, and even though we could see that we had a big traffic spike, we didn't at first understand why the machines were having such a hard time. This is what our concurrent (simultaneous) visitor graph for the past week looks like:

Now let's see, is it possible to determine when the habrahabr article was published? Tricky.

As can be seen, our average traffic this past week has been about 30 concurrent visitors, with a max of around 45 users on the site at the same time. When Habrahabr published the article we suddenly got close to 200 concurrent visitors.

Now, our system is designed to handle more visitors than that. We have had some 300 or so concurrent visitors in the past, when articles have been published about us, but it has not caused a very big problem for our servers. Yesterday, everything slowed down to a crawl, which was very strange.

It all turned out to be due to a malfunction in our test queueing system. As each load test can require quite a lot of system resources to run, we have a queueing system that makes sure we don't try to run too many load tests at the same time. Normally, we wll only allow about a dozen concurrent load tests running. But as it turned out, the queueing system was malfunctioning, and let visitors run as many load tests as they pleased. Under normal traffic conditions, we didn't notice the problem, but when 200 habrahabr.ru visitors all started load tests at the same time, our system suddenly got quite busy.

At one point there were 180 load tests running at the same time - We were load testing close to 200 sites at once!  (must be some kind of new record)

Luckily, practically all of these were small (free) tests and our load generator nodes were actually almost idling despite the excessive number of tests running. The loadimpact.com website, however, had problems. Especially the database had problems keeping up with all the writes caused by test results flowing in from so many concurrent load tests.

This situation went on between about 2 pm (Russian Moscow time - no, we're not russian, but most of the visitors from habrahabr.ru are) and 4 pm, then we found and fixed the problem with the queueing system, causing the number of running tests to go down to normal levels again. So to any of you out there who tried to use Load Impact or run a load test yesterday between 2 and 4 (noon and 2 pm UTC, or early morning in the US), please excuse us and please try again!

 

Another update

Some people seem to have misunderstood the numbers and what actually happened. I'll try to describe it in other words.

What we initially thought was that we just had a website visitor spike of about 200 concurrent visitors (don't confuse this with visitors per hour, HTTP requests, or visitors per day - see this article for an explanation). We couldn't understand why our system was so slow when it has been designed for up to 300 or maybe even 400 concurrent visitors (10-20x our normal traffic).

As it turned out, it wasn't the number of visitors on our website that caused our system to slow down. It was the number of load tests we were running.

People can run free load tests from our start page, and each free load test we execute means that we start up to 50 concurrent simulated users that access an external website that is to be load tested. Those 50 simulated users might load thousands of objects/resources from the external site, and the load test continuously updates our master database with information about how fast different objects on the external site are delivered to the simulated users. We can get hundreds of such load time results per second from a single load test, all of which go into the database.

Normally, we allow about a dozen concurrent load tests, but in this case a software bug made it possible to start an unlimited number of load tests. As most of the visitors were web developers interested in load testing, most of them started a free load test for their site. This meant that we at one point had about 200 load tests running at the same time, generating probably tens of thousands of database updates per second. This was more than our database server could comfortably handle.

Like everyone else, we have to judge what performance levels our systems should be able to handle, and build things accordingly. We try to make sure our system can handle at least 10 times the normal average traffic, which usually makes us able to handle a Habr or Slashdot effect, but in this case a silly little bug killed one of our most basic performance-protecting features, which kind of put a spanner in the works, so to speak.

 

 

 

 

 

 

 

 


37 Responses to Habrahabr перегрузка!

  1. 14 sunnybear 2009-12-04 10:13

    I'm sorry for this - I thought you were more stable :)

  2. 15 SSoft 2009-12-04 11:12

    Habr is not just a news site. Habr is site for web developers. So it was noot a surprise, that you were have so many load tests :)

  3. 16 Fatality 2009-12-04 11:12

    Sorry guys and thanks for this great site, lol

  4. 17 Ragnar 2009-12-04 11:28

    Normally, we would be :-)

    It's probably not good strategy to go public with this, seeing as a load testing service should know to load test their own application! But we try to be open about things and besides, it was kind of funny.

    It is important to distinguish between the web server frontend and the load generator backend though. While the frontend was heavily loaded, the load generator backend was idling despite the large number of tests running.

    The only reason the frontend got in trouble was because of a tiny bug (an integer conversion that failed) in the frontend code. The system is pretty well-performing otherwise, and right now while I'm writing this we're getting habr'd again (even more than last time) with no ill effects so far.

    So, I guess we should to thank you for load testing us!

    Regards,

    /Ragnar

  5. 18 Denis 2009-12-04 11:41

    Хабрахабр Жжот! Русские молодцы!

  6. 19 Mmka 2009-12-04 11:45

    [quote]his situation went on between about 2 pm (Russian time) and 4 pm[/quote] We have more than 5 "Russian times" in our Country.

  7. 20 Tyrel 2009-12-04 11:54

    Mmka, actually, we have 11 of them.

  8. 21 LEXA 2009-12-04 12:02

    The loadimpact.com had been tested by habr!:-)

    PS Now - A new article about loadimpact has been published on habr - http://habrahabr.ru/tag/loadimpact/

  9. 22 Sergie 2009-12-04 12:10

    Sorry for that guys, but it's a common thing. There is just a few websites which could handle habraeffect well.

  10. 23 Pupso 2009-12-04 12:31

    Извините ребята)))
    Зато вы получили бесплатное стресс тестирование своего сервиса)

  11. 24 pxx 2009-12-04 12:31

    Это хабр, детка. ;)

  12. 25 bazilio_lg 2009-12-04 12:41

    Хабрахабр - гавно!

  13. 26 mind 2009-12-04 12:46

    Ты - гавно !

  14. 27 Fintez 2009-12-04 12:51

    все говно, получается :)

  15. 28 Anton Napolsky 2009-12-04 12:58

    Жизнь гавно ? :)

  16. 29 plandem 2009-12-04 13:17

    i will try to test my future sites via habr too :)

  17. 30 Stepan 2009-12-04 13:57

    Привет из России!
    Greetings from Russia, come to visit us. ;)

  18. 31 Kuroki Kaze 2009-12-04 14:18

    "LoadImpact: approved by Habrahabr.ru" :)

    Anyway, thanks for useful service :) In the end it seems that Habr helped you get rid of quite tricky bug :)

  19. 32 BuG_4F 2009-12-04 15:33

    Пользуясь случаем, передаю привет Васе !

  20. 33 Galleas 2009-12-04 15:47

    йо-хо-хо! Хабр жжот!

  21. 34 Чайник 2009-12-04 16:43

    Привет с Хабра :)

  22. 35 Installero 2009-12-04 16:43

    Цитируя последние слова пилота Ту-154, рейса 352, разбившегося под Иркутском: «Эх, все, пиздец!»

    Quoting last words of Tu 154 pilot of Air Flight 352, which crashed while approaching Irkutsk: "Oh, that's all, we're fucked".

  23. 36 lmi 2009-12-04 19:34

    Передаю привет всем Липчанам! http://www.lmi48.ru/

  24. 37 lmi 2009-12-04 19:36

    Мы не сдаемся Несмотря на кризис!

  25. 38 Xaxaxi 2009-12-04 20:03

    Хелоу ворлд =)

  26. 39 Orest 2009-12-05 12:10

    Клятi москалi

    Поубивав би усiх

  27. 40 Marco 2009-12-07 07:47

    ой сейчас понесется хохлосрач))

  28. 41 Сусанин 2009-12-07 08:59

    Orest, Тебе тоже всех благ :)

  29. 42 bardak 2009-12-07 15:50

    1. век живи, век учись - дураком помрешь
    2. и на старуху бывает проруха
    3. и т д

  30. 43 motorio 2009-12-08 20:30

    А вообще молодцы. LoadImpact устоял.

  31. 68 allan 2010-03-02 08:10

    The mistake we made in our first test is that we didn't realize the significant difference between the two different disk based page caching methods available. There's "Basic" caching which is the one we tested, and there's "Enhanced mode". In Basic mode, W3TC will work pretty much the same way as the standard wp-cache plugin which involves invoking a PHP script. In our server benchmark, we've already seen that our server will consume in the region of 80ms for doing that so we're glad if we could avoid it in the elegant manner that Wordpress Super Cache does.
    =======================================wedding vows

  32. 69 sober 2010-03-11 12:31

    There's "Basic" caching which is the one we tested, and there's "Enhanced mode". In Basic mode, W3TC will work pretty much the same way as the standard wp-cache plugin which involves invoking a PHP script. In our server benchmark,

    ---------------------------------------------

    70-536 practice test70-562

  33. 94 France 2010-05-14 03:23

    No se justifica instalar un generador a un motor de 2 tiempos por la relacion conbustible/energia generada.
    No quemes tu plata
    _____________________________________________________________________

    HP HP0-S20646-230

  34. 103 Testking 642-901 2010-07-13 08:17

    Especially the database had problems keeping up with all the writes caused by test results flowing in from so many concurrent load tests.

  35. 121 jump higher 2010-08-10 22:53

    Habr is site for web developers. So it was noot a surprise, that you were have so many load tests :)

  36. 125 Testking 000-331 2010-08-19 11:18

    We found that our Testking 000-210 was suddenly struggling to keep up, and even though we could see the amazing Testking 000-104 we had a big traffic spike, we didn't at first understand why the Testking 000-202 were having such a hard time.

  37. 138 frases para orkut 2010-09-03 15:28

    Like everyone else, they must judge what performance levels our systems ought to be able to handle, and build things accordingly. They try to make positive our system can handle at least 10 times the normal average traffic, which usually makes us able to handle a Habr or Slashdot effect, but in this case a foolish little bug killed one of our most basic performance-protecting features, which kind of put a spanner in the works, so to speak.

Leave a Reply