Friday, December 16, 2011

Debugging a performance problem


As with any typical Application development, performance is mostly conveniently ignored in all the phases of the development life cycle. In spite of it being a key non functional requirement it mostly remains undocumented. It is more so, as the development, test and UAT environments may not really represent the real world production usage of the application as some of the performance problems could not be spotted earlier. Even if the application is put to load test, there are certain in the production environment, like data growth, user load, etc, which may lead to performance degradation over a period of time.

While most performance problems could easily be spotted and resolved, some could be a challenge and may require sleepless nights to resolve. A structured approach may help addressing such issues within reasonably quicker time frame. Here is a step by step approach which should work in most cases.

1.       Understand the production environment

It is important to understand the production environment thoroughly so as to identify the various hardware & networking resources and the middleware components involved in the application delivery. In a typical n-tiered application, it is possible that there could be multiple appliances and servers through which a requested passes through and get processed before responding back to the user with response. Also understand which of these components are capable of collecting logs / metrics or capable of being monitored in real time.

2.       Understand the specific feedback from the end users

Gather details like who noticed the performance degradation, at what time frame, whether it is repeating at pattern or just pulling the system down. Also understand if the entire application is slowing down or some specific application components are not performing. Also try to experience the problem first hand, sitting alongside an end user or if possible use an appropriate user credentials to experience the performance issue. The ‘who’ also matters as in certain circumstances, the application slow down may be for a user associated with some specific role as the amount of data to be processed and transmitted may differ based on the user role.

3.       Review available logs and metrics

Gather available logs and metrics data collected by various hardware and software components and look for information that could be relevant to the specific application, or more specifically the set of requests that could demonstrate the performance issue. As Logging itself could be performance overkill, it would be ideal to switch off the logs or to set it to collect only minimal logs. If that be the case, configure or effect necessary code change to achieve appropriate level of logging and then try to collect the required details by re-deploying the application on to a production equivalent environment.

4.       Isolate the problem area

This step is very important and could be very challenging too. Take the help of developers and performance and load testing tools, to simulate the problem and in the meanwhile monitor for key measurement data as the request and response pass through various hardware and software components.

By analyzing the data gathered from the application end user or out of the first hand experience, and with the available logs and metrics try to isolate the issue to a specific hardware or software component. This is best done by doing the following step by step:

a.       Trace the request from the UI to the final destination, which typically may be the Database.

b.      If the request could reach the final destination, then measure the time taken for the request to cross various physical and logical layers and look for any information that could cause the slow down. If a hardware resource is over utilized, it could so happen that the requests would be queued up or rejected after a time out. Look for such information in the logs.

c.       Then review the response cycle and try to spot the delays in the return path.

d.      Try the elimination technique whereby, the involved component one after the other from the bottom is cleared of performance bottleneck.

Experience and expertise on the application and the infrastructure architecture could come in handy to spot the problem area quickly. It is possible that there could be multiple problems whether contributing to the problem on hand or not. This situation may lead to shift in focus on different areas resulting in longer time to resolve the problem. It is important to always stay in focus and proceeding in the right direction.

5.       Simulate the problem in Test /UAT environment

Make sure that the findings are correct by simulating the problem multiple times. This will reveal much more data and help characterize the problem better.

6.       Perform reviews

If the problem area has already been isolated in any of the steps above, then narrow the scope of the review to the components involved in the isolated problem area. If not, then the scope of review is little wider and look for problem areas in every component involved in the request response cycle. Code reviews to debug performance issues require unique skills. For instance, looping blocks, disk usage, processor intensive operations could be the candidates for a detailed review. Similarly, in case of distributed application, look for too many back and forth calls to different physical tiers could easily contribute to performance problem. Good knowledge on the various third party components and Operating System APIs consumed in the application may sometimes be helpful.

When the problem is isolated to a server and the application components seem to have no issues, then it might be possible that any other services or components running on the server might cause load on the server resources there by impacting the application being reviewed. If the problem is isolated to Database server, then look for dead locks, appropriate indexes etc. Sometimes, lack of archival / data retention policies could result in the database tables growing in a much faster pace leading to performance degradation.

7.       Identify the root cause

By now one should have identified the specific application procedure or function that could be the cause of the problem on hand. Have it validated by doing more simulations and tests in environments equivalent to production.

8.       Come up with solution

It is just not over yet, as root cause identification should be followed by a solution. Sometimes, the solution to the problem may require change in the architecture and might have a larger impact on the entire application. An ideal solution should prevent the problem from recurring and at the same time it should not introduce newer problems and should require minimal efforts. Alternatively if the ideal solution is not a possibility with various constraints, a break-fix solution should be offered so that the business continues and also plan for having the ideal solution implemented in the longer term.

Hope this one is useful read for those of you in production support. Feel free to share your thoughts on this subject in the form of comments.