How to identify the breaking point using stress testing in LoadRunner?
Identifying breaking point of the system will help us to understand when the application is behaving erratically or when the system is breaching the SLA objectives. In this article, I am going to share you how I usually approach to study the breaking point of the system.
HPE LoadRunner and Dynatrace has been used to study the system behavior under unexpected load. Before we start with the approach, first you need to understand what stress testing is and why it is required.
What is Stress Testing?
Stress testing helps us to study the behavior of the system under unexpected load and also to identify the breaking point of the system by injecting above the normal load. E.g. how much weight can you lift, how much speed you can run?
By injecting the load gradually above the actual normal load, the system will start behaving erratically. Erratically in the sense is, there will be change in the response time, resource utilization, hits per second etc.
Stress Test Approach
After successful load test of 250000 transactions per hour, I started designing stress testing. I am going to explain about the approach by considering web services, where my objective is to go beyond 25000 transactions per hour and to identify the breaking point of it.
Before I started my stress test, following are the activities I performed.
- Recycled the environment
- Executed a warm up test for few minutes
- Make sure that the environment is isolated and no one is using my test environment during the stint
In LoadRunner, I designed a scenario with the following run time settings.
- No think time
- No pacing time
- Simulate new user for each iteration
- Clear cache
Duration of my scenario is 4 hours with 8 ramp ups with 50 VUsers each for every 30 minutes. After starting the test, I prefer to monitor it. Usually testers will start the test and go for a break or sleep. First 120 minutes were fine, the hits per second was over 250.
In dynatrace, the system utilization was less than 30% CPU and less than 40% memory. All is well. After third hour onwards suddenly after the ramp up, hits per second dropped to below 100 hits per second. Also the response time went up and CPU and memory utilization peaked up a little. There were no errors.
After the test, I noticed there was two errors in my run-time screen. I downloaded the raw results and launched HPE Analysis. Here is the game:
I plotted below graphs in HPE Analysis:
- Running Vusers Vs Time
- Hits per second Vs Time
- Errors per second Vs Time
- Host Utilization Vs Time
Merge above graphs into one by right clicking on the parent graph. Take Running VUsers Vs time as your parent graph, right click > Merge graph and select the appropriate graphs to merge.
Now right click on the graph > Display Options, select Absolute time. Also, change the granularity to 30 seconds by right click > Set Granularity.
Now open Graph Data > Click to retrieve graph data > Copy all rows to excel sheet. I prefer working in excel sheets as I feel the Analysis UI is legacy and am not comfortable.
If you parse through the data, for every ramp up there will be change in all the parameters: hits per second, resource utilization, response time etc.
Consider below is your SLA objectives:
- 5 second is 95 percentile response time
- 30% utilization of CPU and Memory
You need to highlight the row where the response time is 0.1 seconds, 0.25 seconds, 0.5 seconds and 1 seconds.
|Time||Response time||CPU||Memory||Hits per second||Running|
For the identified time A to E, you need to filter the results in Analysis to identify the number of transactions. As I mentioned earlier, there was two errors. You need to correlate the error time in dynatrace to identify the root cause. There were failures due to connection time out error.
By analyzing the results from LoadRunner and dynatrace, it is clear that the system is capable enough to handle 286 hits per second without breaching the SLA objective of 0.5 seconds. Under this load, the system resource utilization is under normal working limits.
We cannot conclude the stress testing by performing one test, you need to repeat the test at least twice for the consistent behavior. There are multiple approaches can be followed, but what I explained is one of the kind. Please share your approach in the comments section.