(This posting was written by Erik Torsner and edited by Ragnar)
Iterative performance tuning with automated load testing, a case study
Erik Torsner, December 2008
I am the CTO of an e-commerce website built using Ruby On Rails. The site was developed by someone else and I sort of inherited it. The business side of operations is planning a major campaign to drive traffic to the site, including TV commercials in the nations largest TV network. At some point, I was asked the inevitable question: Will the site handle the load? Having been in contact with the people at loadimpact.com earlier I decided to find out if I could use their automated load testing service to verify, and perhaps improve, the performance of our site. This is the story of how it went.
Where are we?
The first thing I did was to try to establish a baseline. To set up my first test, I created an account on Load Impact and just typed in the URL for the front page of our website. The automatic test analyzes the content on the first page, to find out what objects (images etc) are needed to fully load the page. Then it starts a load test with a number of simulated clients (users) that access the page and all its dependencies (images etc) just like a real user would. The automatic test starts simulating 10 concurrent users accessing the page, then increases the number to 20, 30, 40 and finally 50 users accessing the page at the same time. All the while, Load Impact records the average page load time for the clients at the different load levels. The test will automatically stop when the average page load time doubles compared to what it was when the test started, so you are to some extent protected from causing too much harm to a live site.
The result of this test was very valuable but also disturbing to me. The first execution showed that with 20 concurrent users, it took so long to load the page that the test was aborted. Below is a screenshot of what the graph looked like at the time the test was aborted. The average page load times exceeded 20 seconds.
I decided to run the test again, with 10-20 simulated users that increased in smaller steps, so I could find out more exactly at what load level the system stalled. In order to do this I created a standard test profile that I could go back to and run over and over again, just saving the results from each test. This means I can run the exact same test at any time I want in the future. Creating your own test configuration is not hard - click “new test”, give your test a name, and enter the URL of the page that you want Load Impact to test. Then click the “analyze page” and LoadImpact will show you the actual code that is sent to their load engine. If you want to, you can edit this code and add or remove individual objects (URLs). Our page includes script code from Google and to make sure I measured our site and not Google's, I removed it.
I set the load level for my test to start at 10 simulated clients, and to go up to 20 clients, increasing with 2 clients each step of the way. This is what happened when I ran the test:
18 clients was as far as the test would go, at 20 clients it took too long and the test was aborted. It was nice to now have a baseline, but the results were even worse than I had expected. Now that I knew I had some serious problems, the next step was to try and find the bottlenecks and do something about them, aka performance optimization. The overall cycle is to 1) Run load tests to identify the bottlenecks, 2) figure out what can be done about them, 3) change something, 4) go back to step 1 and test again, to see if my change had any impact on performance. When working with performance optimization it's not uncommon to find that the one or two worst bottlenecks has a negative performance impact that is an order of magnitude bigger than the rest. So finding and fixing the worst bottlenecks will often just reveal a number of smaller ones. It's an iterative process where you need to measure-fix-measure over and over again.
Iteration number one
I set up to run the test again. I changed the load level used in the test, as my previous attempts had shown that my site went belly-up almost exactly at 20 concurrent users. For my initial testing I therefore set the load level to start at 10 clients (simulated users) and go up to 15 clients, increasing the load with 1 more client each step of the way.
Before I ran this second test, I opened up an SSH session to our web server and started top. top is more or less the same as the Task Manager in Windows or Activity Monitor in MacOS. Top will update every three seconds and tell you what processes that are using the most CPU and memory.
Watching top while LoadImpact started to put load on my site pretty much directly revealed that the worst bottleneck was the MySQL database, it consumed 80-90% of all available CPU resources. The average page load time for 10 users was 2.9 seconds and it just about doubled with 15 concurrent users. Next step was to figure out why. There are numerous guides online that help you optimize database performance so I'm not going to do into details about it. I began by using the tool mytop, it's really the same as the built in top described above but it reports about what the mysql server is doing. The tool was able to pinpoint the problem quite fast. There was a lot of exact searches in a table that didn't have decent indices. By looking at the query I realized that there was one TEXT field used in a WHERE clause that didn't have an index. So I simply added an index to that table and got ready for the next load test iteration.
Iteration number two
Running the test again in LoadImpact is just a matter of clicking the “Run” button, which makes it extremely simple to repeat a test to see if something about the site performance has changed. I still had top and mytop running so all I had to do was run the test again and see if the mysql server process still used as much CPU. The average page load time for 10 users had dropped to 1.59 seconds and didn't seem to increase a lot going up to 15 users. I therefore changed the test to go up to 20 users, where bad things had happened previously. It turned out that the page load time was only slightly higher at 20 users, so I had clearly made a major improvement with the new database index. I again changed the load level used in my test so that it started at 10 users and went all up to 50 users, then re-ran the test, and what do you know - the page load time for 50 users topped out at 1.8 seconds. The mysql process hardly consumed any CPU at all averaging around 2-5% of available CPU time. The mysql index was absolutely the big problem. I went back and did some other fine tuning to other tables and indices before starting the next iteration.
Iteration 3-4
I did some fix-test-repeat iterations where I tuned and evalutated the performance of different indices. This is what the page load time for the 10-15 client test looked afterwards. The curve had straightened out, indicating that the load level used didn't stress the site at all - it was much faster now than it had been previously.
I then ran another 10-50 client test and got this curve:
The end result was that the average page load time for 10 users was down to 1.29 seconds and later tests showed that it kept well under 2 seconds up until roughly 100 concurrent users. Instead of the database process consuming a lot of CPU, the process consuming the most CPU was now fast cgi. Mysql was way down in the list. I noted, however, that sometimes when the load was around 80-100 concurrent users, I got sudden spikes where the page load time would rise to well over 10 seconds, just to then fall back to normal levels again.
Iteration 5
The next thing to examine was the reason behind the sudden spikes in page load time. I prepared to run another test, but instead of going from 10-100 in steps on 10, I started the load at 70 concurrent users. The problems with the sudden spikes reappeared almost directly. Watching the output from top revealed that the problem wasn't CPU usage anymore, during these spikes, the CPU load on the server was well under 20%. Instead it was I/O wait time that was causing the problem. I/O wait time is almost always the same things as reading from or writing to disk. In my case, since the web site isn't reading or writing large amounts of data to disk, the prime suspicion would be that the server is out of physical RAM and the disk I/O is due to a lot of swapping that occurs. The server currently have 1Gb of RAM, so the next fix is to increase it. Server memory is actually not that expensive so we decided to install 8 Gb of RAM and do another round of tests.
What happened after?
Fastforward in time and we have now upgraded our server to a quad-core machine with 8GB RAM:
The curves now look OK with up to 400 simulated users (the page load time climbs in a fairly straight line from 1.5 seconds to a little below 4 seconds at 400 users). The server seems to be idling at peak load and the memory usage goes up to about 1.6 GB, so we now suspect the network link to be our limiting factor. We only have a 10 Mbit/s connection but the test with 400 concurrent users should generate peak loads higher than that (maybe 40-50 Mbit/s).
Conclusions
Overall, the experience has been that an automated load test like Load Impact, that can be run instantly just by pressing a button, is a great tool that significantly simplifies performance optimization work. Normally, it would be a pain to install, set up and configure a load testing environment and chances are you would go for some quick-and-dirty solution that requires no learning - a shellscript using wget maybe - which might or might not do what you want it to and where tests are hard to repeat under the exact same circumstances.
Using an automated online service like Load Impact means you're up and running in no time at all, with precious little learning involved. You're also able to quickly and without hassle execute many identical load tests in between testing different optimization strategies and tricks. It is a tool that I can really recommend. I started out intending to use it only for performance verification but ended up using it also for troubleshooting and optimization. In the future I will probably use it during actual development also, to get immediate feedback on what performance impact newly written code has on the overall solution. This might enable us to detect performance bottlenecks early, and spend less time working on bad code tracks that don't perform well. Testing regularly while developing also means we will have a pretty good grasp of the system's general performance characteristics come release day, and will feel more confident about meeting our performance targets.
Erik Torsner



