Thursday, January 20, 2011

From the Trenches: It's not just one thing...

By Doug

I recently had the pleasure of working with a major news organization to help them resolve some serious problems they were having with an important in-house application. They shared the usual complaint with me, that transactions are running very slowly with no visibility into where "the problem" was, or why "it" occurred.

The usual steps of checking PerfMon -- it was a Windows-based application -- event logs, and network throughput provided no actionable insights over many weeks. Very frustrating for all those involved. They even did the “re-boot the entire thing and hope for the best” to no avail. But, when they did that, hundreds of people at the company had to wait while the system bounced. A complete waste of time and resources.


Compounding the problem was this application’s dependency on various web services to complete key transactions. These just added additional variables in an already confusing situation. The IT department was losing patience and credibility every day. They desperately needed to find “the problem.”

I believe it was Peter Drucker who famously wrote, ”What gets measured, gets managed”. But it’s pretty difficult to measure every part of an application given how complicated modern applications are. Perfmon and traditional network probes, and event management systems don’t provide sufficient granularity for the tricky troubleshooting exercises I often get pulled into. However, monitoring every web service, metric, method, class, and memory allocation is not practical either – at least not with primitive tools.

Fortunately, I brought OPNET AppInternals Xpert to the the party. It dramatically speeds up troubleshooting by simultaneously looking at thousands of metrics, methods, classes, and memory in production, and then automatically identifying patterns that point to the most important problems. By instrumenting each application component, I was able to effectively turn the lights on in what was a dark room.

As we find with many applications when using the APM suite from OPNET, this application didn’t merely suffer from a single root cause. Instead it was plagued by a series of problems that each had a deleterious impact on performance. I pinpointed the most guilty .NET classes, with the worst response time. I isolated more than one production system with capacity bottlenecks. I also identified problematic SQL queries that were not written with performance in mind.

Armed with several actionable insights from AppInternals Xpert, the IT team had productive discussions with their developers to make the application perform. They also left with a clear method for measuring the application components on an ongoing basis.

No comments: