The most critical event in a project's life is when users get their glimpse at it in production. Though they may be seeing all of the functionality they specified up front, they are most often bothered by the performance of the application. As important as performance is, it's an aspect of functionality that is often not specified as part of the requirements nor is it often not fully considered in the QA process. As is the case with any untested aspect of an application, it may work or it may not. The only way to truly know is to measure. In this article, we will explore how we can understand the user experience so that performance doesn't spoil that first impression or even worse, be failed by the users.
It is only as release deadlines approach that most applications finally enter QA and user acceptance processes. The QA process necessarily focuses on making sure the functional requirements have been met and that as many bugs as possible can be identified. However, performance requirements are rarely specified and hence performance is an aspect of QA that is often ignored. It is far too common to see that performance is only understood once the application has been deployed into production. Once in production, development teams are often forced to react to a complex set of circumstances that are expensive in both time and political capital. Moreover, they often create a lot of friction within and between teams. However, most of these difficulties are easily avoided if the QA team takes the time to understand the user experience prior to deployment. This implies that QA has performance requirements. Failure to provide clear performance goals leads to vague sets of tests being defined (if any are defined at all) which leaves the entire acceptance process at risk. It doesn't have to be this way.
Benchmarking for Quality Assurance
When constructed with care, a benchmark can be the difference between identifying aspects of an application that are performing poorly or having your users finding it for you. The key phrase in the statement is "constructed with care." Let's eliminate the ambiguity in this phrase by defining some key components and their roles in the construction and execution of a successful benchmark. These components include the application, data, external data sources, the underlying hardware platform, a test harness and last but not least, a set of performance requirements.
If performance is an important requirement of your application then it should be specified along with every other requirement. The importance of specifying performance requirements is twofold: first, it places the responsibility of specifying this requirement where it belongs, on the project sponsors. Secondly, it sets expectations regarding performance so there should be no surprises for anyone.
Consider the case where the user has asked for a fairly complex calculation. What typically will happen is that the development team will deliver the functionality with a response time that does not meet user expectations. If the performance requirement had been stated up front, the developers may have been able to deliver a more satisfying result. At the very least, they could have told the users that they would not be able to meet their expectations or that to do so would be very expensive. By working together, the development team and users should be able to hash out a reasonable set of performance requirements. Quite often users will not know how to set a performance requirement. In this case it maybe best to start with requirements that are unacceptable and then work from there. The import thing is to start the process.
There is not much one can do to performance test an application before it is fully functional. However, one can and should performance test each of the components that will be used to build the final product. By taking this extra step during development, potential problems can be identified and dealt with very early on in the process. It goes without saying that the earlier problems can be identified, the better chance you have of dealing with them in your normal day-to-day activities.
To further illustrate the point, consider the scenario where you have a query that is required to complete within one second when receiving two requests per second. If a supporting component is taking one second to respond under this load in an isolated test then you can be sure that you will exceed your performance target in the integrated system.
However useful this technique is, there are classes of performance problems that won't appear until then entire application can be tested. For example, you may have two components that are competing for a common resource. This competition may introduce contention that would not be present when either of the components was tested in isolation. The resource in contention may be something as simple as CPU. If this is the case, the solutions could be also as simple as adding more CPU.
However useful performance testing is while the application is being developed, one must be aware that there is a certain level of granularity under which performance testing becomes less useful. For example, testing particular coding features on how they perform if often called micro-performance benchmarking. Micro-performance benchmarks are very difficult to get right and hardly every provide more then a marginal gain in performance. However they may prove to be quite useful in helping you resolve micro-level performance problems that have been identified.
One of the difficulties that developers face when testing is having access to all of the external data sources that they may need. For example, an online shopping system may want to access to a credit card charge service provided by a third party. Since that service is most likely not available to developers they will need to fake it out (using a mock object). The danger in performance testing a system that has access to external data sources and services mocked out is that mock object will perform differently than the real service. The biggest danger is that the mock will immediately return to the caller without considering timing and other issues.
Consider this sequence of events needed to charge a customers credit card for a purchase.
- The card holder's data is copied into a form
- A connection to the credit card service is located
- The charge service is invoked using a call via a (SSL) socket
- The caller will wait for conformation or denial
- Value is returned and client is released.
Steps 1, 4 and 5 should have the same performance consequences between a real and mocked system. Though it may not be the case, let us put step 2 into the same category as steps 1, 4 and 5.
In step 3, the real implementation may use a webservice call to another service running on another part of the planet. In order to make the call the data should be encrypted. Then we will have to wait for the response before deserializing it into a response. Unless the mock simulates all of this activity your benchmark will be inaccurate. More importantly, performing a credit card transaction takes time and unless the mock "waits," the query will run faster then it should. This can have the downstream effect of creating an artificial bottleneck in the application. It will also leave the network untouched, which may also skew results.
Differences between the mock and the real implementation can put the creditability of the entire benchmark at risk. More over a bad mock can create false bottlenecks, which may cause developers to be chasing problems that don't really exist.
If you are doing a production level test on a system then the database must be configured and populated to a production level. Anything less and there is a strong chance that the many optimizations that have been worked into database technology will cause harm.
The two biggest hardware impacts on database performance are the amount of available memory and disk functionality. More memory means that databases can avoid making trips to disk. Because of the relativity slow access speeds, a trip to the disk is the last thing that a database wants to do. And this brings us to the second concern, disk access speeds. Whatever disk configuration (size, speed, transfer rates, number of controllers, etc) is in production, it should be the same in your test environment. Anything less and you will need take care if you are dealing with a system that is bounded by I/O.
As we've just stated, the amount of memory that is available to the database can have a major impact on performance. Simply stated, more memory means that more data can be kept in memory which should translate into fewer trips to the disk. The danger is that if the database contains too little data then after an initial warm-up, the database will never have to read from disk again. Even if there is enough data to induce some disk activity, it may not be enough to induce problems related to disk activity.
The perfect hardware to use is that which is in your production system. The reason is, it has the exact amount of capacity that you need to test against and it is configured exactly the way you need it to be. Any deviation from the hardware in your production system and you run the risk of either creating an artificial bottleneck or you may just end up moving it to some place else. Any way you slice it, you are in danger of chasing phantom performance issues while the real ones will remain hidden.
Why this is so is a question of how system capacity is utilized. All hardware has a finite capacity. The CPU processes a set number of instructions per second. The network can move a finite quantity of data per second. Trouble begins when you saturate a system beyond its capacity. Take the following scenarios for example.
Have you ever run into the situation where you have a performance problem in production but you can't reproduce it in QA or vice versa? It is common practice to have test environments that do not have the same capacity as the production system. This can result in some strange and confusing observations. For example, add an extra CPU into the production system and you could see longer response times then what you see in QA. This may seem like a paradox but in fact is a normal reaction for applications that are network bound. More CPU capacity allows the application to put more pressure on the network. Once that pressure crosses a certain threshold the whole network will start thrashing and in the process kill response times. In this instance less CPU is actually better.
Another much simpler example is having Gigabit network capacity in production and Megabit in QA. If your application's network utilization is more than what the Megabit network can handle but under the limits of the Gigabit then you've just created an artificial bottleneck. But it is worse than that as this artificial bottleneck will work to hide all others.
The lesson here is that it is very difficult to take benchmarking results from environment and extrapolate them into another. The only way to be sure is to test in the environment that you intend to deploy to.
It goes without saying the reason for benchmarking is to measure some aspect of a system's performance. However, we can't escape the fact that the simple act of taking a measurement will impact performance and hence impact our measurement. Just how big this impact is depends on what, how much, and how frequently we are taking measurements. What we need to do is minimize the effects of measuring during the run. This can be achieved by following a few simple rules.
- Use tools that impose minimal overhead
- Take measurements from the edge
- Process measurements "out of band"
- Understand what you are measuring for
- Measure only what you need to and no more
- If you need to measure many things measure them one at a time
- Don't use tools that compete for the resources your application is dependent upon
The Unix command line utility vmstat is a lightweight tool that can be used to monitor the health of the Unix kernel. To use it you simply type something like "vmstat 5" on the command line and it will happily dump the current state of the performance counters to a console window every five seconds. While vmstat imposes little performance penalty, writing to a console window is time consuming and should be avoided. Although the penalty is hardly noticeable when running vmstat, writing logging information to a console window can harm your applications performance. In this case it is better to write the data to a file and then running a "tail -f" to view the messages. This is also an example of processing the messages 'out of band.'
The "measure only what you need" and "measure one thing at a time" rules are tied into the "impose minimal overhead" rule. Taking too many measurements will result in a cumulative drag on system performance. Another way to minimize drag due to monitoring is to take our measurements on the outer edges of the application. For example, taking a user response time in the test harness should not interfere with the inner workings of the application and as such impose no performance penalty what so ever. Moreover, if you are interested in user response times, then you should not be logging database response times until you find evidence that says you should be.
A test harness is a piece of software that can be configured to drive an application either as a user or an external application would. The most common test harnesses are those used to drive HTTP requests at a webserver.
In addition to making requests, a good test harness will also offer a number of essential services. These services include the ability to:
- Emulate hundreds if not thousands of users
- Throttle the rate a which requests are made
- Randomize the rates at which requests are made
- Perform client side tasks
- Randomize the parameters of the request
- Measure and report on the time needed to respond to requests
- Monitor other aspect of the system
Surprisingly, a misconfigured test harness can be the bottleneck in the system. What makes this particularly hideous is that bad results will almost always be attributed to the application and not the harness itself. Of course, looking in the wrong place makes the diagnosis particularly difficult. The first step in any benchmarking exercise is to make sure that the harness can deliver the specified load.
Though it may not seem like it at times, computers are highly deterministic machines in that if you give a computer the exact same instructions then it should produce a result in the same amount of time. However, we understand that the computer is going to execute many other things and these things will happen at "randomly." Just how much other stuff is taking place is what introduces variability in response times. If too many other things are happening during a benchmark, then our response times will become too corrupted (or noisy) to provide useful meaning. The question is, how do we know when are results are too corrupt to be meaningful? As you may have guessed the answer is in the title of this section.
It is said that statistics allow us to see that which would otherwise remain unseen. When applied to benchmarks, nothing could be closer to the truth. In benchmarking exercises the relevant statistics are average, variance, median, minimum, maximum and some like to also view the 90th percentile. In a perfect benchmark with no interference, the variance should be 0 and all other values should be identical. The fact that you will never see this doesn't mean that you shouldn't strive to achieve it. If you can't, then you need to understand and determine why and if then work to eliminate the interference.
For example, in a recent benchmarking exercise we were told that we were on an isolated network. Unfortunately this was not the case and yet we passed through almost the entire exercise (with the help of time zones) without seeing any interference. However there was one series of runs where the variance was high. It was this single statistic that resulted in our re-asking the question of network isolation. The answer allowed us to invalidate the results from that test and validate the results from all the other runs.
Putting it all together
Now that we've looked at the different aspects of benchmarking, it is time to put them all together. There are two basic types of benchmarking exercises, one to determine response times under a given load and another to determine throughput. Each of these benchmarks needs to be configured differently.
If we want to measure throughput, then there will be a load that we can place on the server where it produce an optimal value. The trick in conducting this type of benchmark is to find the point where the throughput starts falling off by adjusting the number of requests per second that you throw at the server. In normal system benchmarking we are interested in user response times give a specified load. From these descriptions we can see that where as load is variable in the former benchmark, it is constant in the later. The consequence is that we need to sacrifice user response time for throughput and throughput for use response time.
Another important aspect of putting it all together is the ability maintain server centric view of the test. In a server centric view we look at throughput as a function of the number of requests received within a fixed time frame. For example, we may want a load of 10 requests per second. Ultimately how the server is loaded will be configured in the test harness so we need to take care to ensure that the harness can consistently deliver the desired load. If the harness fails to perform then it will become the bottleneck in the system. Also as important is to ensure that the request arrive somewhat randomly. In other words, we may want 10 requests per second but we certainly don't want one request every 100ms. This regular heartbeat of traffic can create anomalies that skew results. It is better to use a random distribution that has a mean of 100ms inter-request arrival time. This allows requests to bunch up and empty out in a more natural fashion.
Once everything is installed, isolated, and configured, you can run tests. While some may prefer to ramp up to the specified load, others find it useful to hit the server with the full load right up front. The advantage of the later approach is you will get an immediate reading on how close you are to your requirements. If requirements haven't been met, then one can use the results of the full benchmark to devise other tests that can help further isolate and identify trouble spots. Whatever approach you decide you use, you need to make sure that it is quickly moving you towards your goal of understanding what the users will experience when they start using the system.
Though this document only touches on the different aspects of benchmarking, it exposes deep issues that must be addressed. It also demonstrates the difficulties in creating a realistic schedule. Most often back of the envelope estimates are drawn up in a meeting and it is only after the first attempt that teams quickly realize that there is more to it then what they first anticipated. Benchmarks almost always produce surprising results and it is very difficult to create a proper schedule around surprises. Teams that consider all of the issues addressed in this article may still run into some surprises, however it is unlikely that the numbers and the severity will be close to those teams that do not take the time to understand the user experience.