Tuesday, October 26, 2010

Robust Root Cause Analysis – Why So Elusive?

By Alain Cohen

Working with many IT organizations as clients, I frequently hear IT personnel state that they would like to become more “proactive” and less “reactive”. This is definitely a worthwhile goal to pursue in order to reduce the incidence of problems in the production environment. However, we have to face the reality that proactive approaches will never be perfect enough to prevent all performance and availability problems from emerging in production. In addition to the numerous bugs or inefficiencies that can escape the QA process, the myriad changes and unpredictable conditions that occur in production can trigger new performance problems for an application that has otherwise been running smoothly. Such changes may include: infrastructure and configuration modifications on the servers or in the network; deployment of new apps that have problematic interactions with each other; and initiation of new processes that interfere with an application’s execution (e.g., a scheduled back-up process that consumes significant network resources.)

Another fact we must face is that there is often significant business pressure associated with solving performance problems – quickly! It’s not surprising, therefore, that many IT organizations cite root cause analysis (RCA) as their most important objective in deploying APM. So, why is it that successfully implementing such a capability is still the most elusive to IT organizations? One reason is that reliable root cause analysis requires a sophisticated APM solution with a number of key properties that work in combination. In an earlier post, I wrote about this set of capabilities under the heading of “High Definition APM”. I want to mention a few reasons and examples about why these capabilities -- breadth, depth, integration, analytics, and low overhead -- are necessary for RCA. I’ll start by discussing what I consider the two most central qualities: breadth and depth.

Depth is probably the most fundamental to successful RCA. Depth is what provides you with a “smoking gun” – highly specific information that enables you to conclusively identify the cause of a problem. An APM solution with significant depth must provide more than just performance metrics. Forensic data, reflecting the fine-grained timing and behavior of application transactions is generally the most useful information for RCA.

Deep instrumentation is obviously only useful if it is performed in the parts of the application infrastructure where the problems are actually happening. Since we can’t know in advance where the problems are, a solid RCA capability requires breadth in addition to depth. Breadth simply implies that we can cover all the parts of the infrastructure where problems are likely to originate. The more coverage we have, the more likely the deep instrumentation can pay off.

I’d like to illustrate the role of these APM properties with an example of breadth and depth working in conjunction to deliver RCA. My company, OPNET, recently worked with a large bank in Asia to implement two popular components of our APM solution in order to reduce the mean time to resolution for performance problems that were persisting in their environment. The bank’s IT organization implemented AppResponse Xpert for End-User-Experience (EUE) and deep packet inspection capability, and AppInternals Xpert for back-end server monitoring and troubleshooting. Using the EUE capability, the bank was able to observe every transaction for applications of interest and to isolate specific instances of poorly performing transactions. Using its network-based perspective, AppResponse Xpert was also able to show that the majority of transaction delay was associated with back-end processing, setting the stage for use of AppInternals Xpert. This latter solution was able to trace problematic user-submitted transactions as they entered the application server, and through the application’s Java business logic in order to expose method calls with long execution times. The ability to trace at the code level provided the necessary information to application developers to improve the code’s performance and reduce the overall transaction time significantly. The breadth provided by the combination of solutions, spanning network and back-end server, was important to first support triage, and in a second step, to pinpoint the root cause. In a similar way, deep collection and analytics are valuable to deploy in other parts of the production environment, including the database, as well as client machines. The broader the net you can cast, the more robust your APM capability.