Blog

Increased quotas

Increased data transfer quotas for Load Impact premium accounts

The data transfer quotas for Load Impact Premium users (Basic, Professional, Advanced) have been increased to allow more testing using the premium accounts.

The reason to have quotas in the first place has been to on one hand prevent abuse and on the other hand to restrict resource usage. As we have seen that there haven't been a lot of abuse issues surrounding the service, and we have plenty of spare capacity to run tests, we have decided to increase the quotas in order to allow our premium users to get as much testing done as possible.

Previously, these were the data transfer limits:

Load Impact BASIC:

- 40 GB data transfer per 30 days
- 20 GB per target IP per 30 days
- 5 GB per target IP per 24 hours

Load Impact PROFESSIONAL:

- 100 GB data transfer per 30 days
- 50 GB per target IP per 30 days
- 15 GB per target IP per 24 hours

Load Impact ADVANCED:

- 500 GB data transfer per 30 days
- 200 GB per target IP per 30 days
- 50 GB per target IP per 24 hours

Now, the data transfer limits are:

Load Impact BASIC:

- 50 GB data transfer per 30 days
- 50 GB per target IP per 30 days
- 10 GB per target IP per 24 hours

Load Impact PROFESSIONAL:

- 300 GB data transfer per 30 days
- 300 GB per target IP per 30 days
- 50 GB per target IP per 24 hours

Load Impact ADVANCED:

- 1000 GB data transfer per 30 days
- 1000 GB per target IP per 30 days
- 200 GB per target IP per 24 hours

 

Read more about the usage quotas for the different account types

Wordpress load testing part 3 - Multi language woes

Understanding the effects of memory starvation.

This is the third part in a series of posts about Wordpress and performance. In part 1,
we took a look at Wordpress in general. In part 2 and part 2.5 we reviewed a couple of popular caching plugins that can boost performance. In this part, we'll start looking at how various plugins can have a negative effect on performance and if anything can be done about it.

In the comments for one of the previous posts in this series, Yaakov Albietz asked us if we used our own service www.loadimpact.com for the tests. I realize that I haven't been that obvious about that, but yes, absolutely, we're exlusively using our own service. The cool thing is that so can you! If you're curious about how your own web site handles load, take it for a spin using our service. It's free.

We started out by looking for plugins that could have a negative effect on Wordpress performance, thinking, what are the typical properties of a bad performer plugin? Not so obvious as one could think. We installed, tested and tinkered with plenty of suspects without finding anything really interesting to report on. But as it happens, a friend of a friend had just installed the Wordpress Multi Language plugin and noted some performance issues. Worth taking a look at.

The plugin in question is Wordpress Multi Language (WPML). It's got a high rating among the Wordpress community wich makes it even more interesting to have look at. Said and done, we installed WPML and had it for a spin.

The installation is really straight forward. As long as your file permissions are set up correctly and the Wordpress database user have permissions to create tables, it's a 5-6 click process. Install, activate, select default language and at least one additional language and your done. We're eager to test, so as soon as we had the software in place, we did our first test run on our 10 post Wordpress test blog. Here's the graph:

Average load times 10 to 50 users

Ops! The baseline tests we did for this Wordpress installation gave a 1220 ms response time when using 50 concurrent users. We're looking at something completely different here. At 40 concurrent users we're getting 2120 ms and at 50 users we're all the way up to 5.6 seconds or 5600 ms. That needs to be examined a bit more.

Our first suspicion was that WPML would put additional load on the MySQL server. Our analysis was actually quite simple. For each page that needs to be rendered, Wordpress now have to check if any of the posts or pages that appears on that page have a translated version for the selected language. WPML handles that magic by hooking into the main Wordpress loop. The hook rewrites the MySQL query about to be sent to the database so that instead of a simple "select foo from bar" statement (over simplified), it's a more complex JOIN that would typically require more work from the database engine. A prime performance degradation suspect unless it's carefully written and matched with sensible indexes.

So we reran the test. While that test was running we sat down and had a look at the server to see if we could easily spot the problem. In this case, looking at the server means log in via ssh and run the top command (if it had been a Microsoft Windows box, we'd probably have used the Sysinternals Process Exporer utility) to see what's there. Typically, we'd want to know if the server is out of CPU power, RAM memory or some combination. We were expecting to see the mysqld process consume lots of CPU and verify our thesis above. By just keeping an unscientific eye on top and writing down the rough numbers while the test was running, we saw a very clear trend but it was not related to heavy mysqld CPU usage:

20 users: 65-75% idle CPU 640 MB free RAM
30 users: 50-55% idle CPU 430 MB free RAM
40 users: 45-50% idle CPU 210 MB free RAM
50 users: 0%   idle CPU  32 MB free RAM

As more and more users was added we saw CPU resource usage go up and free memory availability go down, as one would expect. The interesting things is that at 50 users we noted that memory was extremely scarce and that the CPU had no idle time at all. Memory consumption increases in a linear fashion, but CPU usage suddenly peaks. That sudden peak in CPU usage was due to swapping. When the server comes to the point where RAM is running low, it's going to do a lot more swapping to disk and that takes time and eats CPU. With this background information in place, we just had to see what happended when going beyond 50 users:

That's very consistent with what we could have expected. Around 50 concurrent users, the server is out of memory and there's a lot of swapping going on. Increasing the load above 50 users will make the situation even worse. Looking at top during the later stages of this test confirms the picture. The kswapd process is using 66% percent of the server CPU resources and there's a steady queue of apache2 processes waiting to get their share. And let's also notice that mysqld is nowhere to be seen (yes, this image is only showing the first 8 processes, you just have to take my word for it).

 

 

The results from this series of tests are not WPML specific but universal. As we put more and more stress on the web server, both memory and CPU consumption will rise. At some point we will reach the limit of what the server can handle and something got to give. When it does, any linear behavior we may have observed will most likely change into something completely different.

There isn't anything wrong with WPML, quite the opposite. It's a great tool for anyone that want a multi language website managed by one of the easiest content management systems out there. But it adds functionality to Wordpress and in order to do so, it uses more server resources. It seems WPML is heavier on memory than on CPU, so we ran out of memory first. It's also interesting to see that WPML is actually quite friendly to the database, at no point during our tests did we see MySQL consume noticeable amounts of CPU.

 

Conclusion 1: If you're interested in using WPML on your site. Make sure you have enough server RAM. Experience of memory requirements from "plain" Wordpress will not apply. From the top screen shot above, we conclude that one apache2 instance running Wordpress + WPML will consume roughly 17 Mb RAM, we havent examined how that differs with number of posts, number of comments etc, so lets use 20Mb as an estimate. If your server is set up to handle 50 such processes at the same time, you're looking at 1000 Mb just for Apache. So bring out your calculators and calculate how much memory your will need for your server by multiplying the peak number of users you expect with 20.

Conclusion 2: This blog post turned out a little different that we first expected and instead of blaming on poor database design we ended up realizing that we were watching a classic case of memory starvation. As it turned out, we also showed how we could use our load testing service to provide a reliable source of traffic volume to create an environment where we could watch the problem as it happens. Good stuff, something that we will appear as a separate blog post shortly.

 

Feedback

We want to know what you think. Are there any other specific plugins that you want to see tested? Should we focus on tests with more users, more posts in the blog, more comments? Please comment on this post and tell us what you think.

 

Scheduled tests

Keep your server awake while you are sleeping!

Today, we launched a new Load Impact premium feature - scheduled tests. You can now configure a test that is run at some specific time in the future, or you can configure it to run once every day, or once every week.

This functionality is useful to people who want to run a load test during low-traffic hours (often in the middle of the night, or early mornings) but who don't want to sit up and press the "start" button at the desired time.

It is also good if you want to run the same load test repeatedly, maybe once a week, to get a history of the performance of your site.

Scheduled tests are accessed through a new option in the main menu

 

The test scheduling screen shows your currently scheduled tests, and lets you schedule new ones. Note that if you schedule repeating tests, your account will accumulate test results over time. As an account can only have a certain number of test results stored, you get may choose to have new scheduled tests delete old test results, if necessary, to make room on your account.

Test scheduling screen

 

Infinite slashdotting

How many times can you get slashdotted?

Slashdotting (or "the slashdot effect") is a term coined by the site slashdot.com. It means that some big website/blog/newssite writes an article about you, causing tons of their visitors to take a sudden interest in your website, and causing your site to get overloaded with all the new visitors.

Slashdot has been around for a long time, but now there are many other sites that are big enough to cause a slashdot effect when they publish an article about something. One such site is the Russian web developer site habrahabr.ru where they call the effect the "habr effect".

We have had extensive experience with the "habr effect". We were mentioned on habrahabr.ru in february 2009 the first time, which caused some traffic to come our way. However, it seems habrahabr.ru has increased its number of readers substantially since then - they seem to have around 10x as much traffic today as they did early 2009, according to Alexa.

So, on wednesday they published an article about the importance of load testing as a way to avoid the habr-effect. They suggested people use Load Impact for their load testing (which we think is a splendid idea, of course). This resulted in a lot of people coming to our site to try out our load testing service.

This wouldn't have been a problem under normal circumstances, but an unknown bug in our frontend code had made it possible to start an unlimited number of load tests. We never noticed this under normal traffic conditions, but when several hundred new visitors arrived at the same time from habrahabr, and many of them tried to start a free load test, we suddenly found ourselves executing close to 200 load tests at the same time!

Our system was having problems: We had been habr'd (slashdotted) because of an article about how to avoid getting habr'd.

the effect the habrahabr article had on our (concurrent) site visitors

After some frantic bug-hunting, we found and fixed the frontend bug, and things started working much better.

Then we thought "hey, this was somewhat funny". We decided to write a blog article about the load testing people who got overloaded because of an article saying you should use their load testing service to avoid getting overloaded.

I wrote and published that article on our blog yesterday (thursday).

Today (friday) habrahabr picked it up and published a link to it. Guess what happened?

Yes, same thing (but a longer spike this time, so more traffic)

 

So today we have achieved something remarkable:

  • Today, we were habr'd because of an article about being habr'd because of an article about how to avoid being habr'd!

 

All programmers out there will love the recursion, I bet.

 

Note: to be honest we didn't have much problem with the traffic today, even though it peaked at more visitors per hour than we normally get in a whole day.

 

 

Habrahabr перегрузка!

Russian site habr's Load Impact!

Yesterday, the Russian site habrahabr.ru wrote an article where they warned people about the habr effect (see slashdot effect) and suggested that it was prudent to use Load Impact to load test your site before getting swamped by traffic due to some popular blog or newssite (like habrahabr.ru) publishing an article about you.

This, of course, caused us to get habr'd!

We found that our system was suddenly struggling to keep up, and even though we could see that we had a big traffic spike, we didn't at first understand why the machines were having such a hard time. This is what our concurrent (simultaneous) visitor graph for the past week looks like:

Now let's see, is it possible to determine when the habrahabr article was published? Tricky.

As can be seen, our average traffic this past week has been about 30 concurrent visitors, with a max of around 45 users on the site at the same time. When Habrahabr published the article we suddenly got close to 200 concurrent visitors.

Now, our system is designed to handle more visitors than that. We have had some 300 or so concurrent visitors in the past, when articles have been published about us, but it has not caused a very big problem for our servers. Yesterday, everything slowed down to a crawl, which was very strange.

It all turned out to be due to a malfunction in our test queueing system. As each load test can require quite a lot of system resources to run, we have a queueing system that makes sure we don't try to run too many load tests at the same time. Normally, we wll only allow about a dozen concurrent load tests running. But as it turned out, the queueing system was malfunctioning, and let visitors run as many load tests as they pleased. Under normal traffic conditions, we didn't notice the problem, but when 200 habrahabr.ru visitors all started load tests at the same time, our system suddenly got quite busy.

At one point there were 180 load tests running at the same time - We were load testing close to 200 sites at once!  (must be some kind of new record)

Luckily, practically all of these were small (free) tests and our load generator nodes were actually almost idling despite the excessive number of tests running. The loadimpact.com website, however, had problems. Especially the database had problems keeping up with all the writes caused by test results flowing in from so many concurrent load tests.

This situation went on between about 2 pm (Russian Moscow time - no, we're not russian, but most of the visitors from habrahabr.ru are) and 4 pm, then we found and fixed the problem with the queueing system, causing the number of running tests to go down to normal levels again. So to any of you out there who tried to use Load Impact or run a load test yesterday between 2 and 4 (noon and 2 pm UTC, or early morning in the US), please excuse us and please try again!

 

Another update

Some people seem to have misunderstood the numbers and what actually happened. I'll try to describe it in other words.

What we initially thought was that we just had a website visitor spike of about 200 concurrent visitors (don't confuse this with visitors per hour, HTTP requests, or visitors per day - see this article for an explanation). We couldn't understand why our system was so slow when it has been designed for up to 300 or maybe even 400 concurrent visitors (10-20x our normal traffic).

As it turned out, it wasn't the number of visitors on our website that caused our system to slow down. It was the number of load tests we were running.

People can run free load tests from our start page, and each free load test we execute means that we start up to 50 concurrent simulated users that access an external website that is to be load tested. Those 50 simulated users might load thousands of objects/resources from the external site, and the load test continuously updates our master database with information about how fast different objects on the external site are delivered to the simulated users. We can get hundreds of such load time results per second from a single load test, all of which go into the database.

Normally, we allow about a dozen concurrent load tests, but in this case a software bug made it possible to start an unlimited number of load tests. As most of the visitors were web developers interested in load testing, most of them started a free load test for their site. This meant that we at one point had about 200 load tests running at the same time, generating probably tens of thousands of database updates per second. This was more than our database server could comfortably handle.

Like everyone else, we have to judge what performance levels our systems should be able to handle, and build things accordingly. We try to make sure our system can handle at least 10 times the normal average traffic, which usually makes us able to handle a Habr or Slashdot effect, but in this case a silly little bug killed one of our most basic performance-protecting features, which kind of put a spanner in the works, so to speak.

 

 

 

 

 

 

 

 

 1 2 3 Next →