Friday, March 4, 2011

From the trenches: “The network is painfully slow.” Really!?!

By Archana

The employees of a certain government agency department I met with recently were extremely frustrated about “the slow network.” After years of conditioning, they had become accustomed to key applications hanging and responding slowly without warning or explanation from the IT department. Like at many enterprises I visit that are still using primitive tools to monitor and care for their applications, the IT department had nearly become dysfunctional due to endless finger-pointing between application and network teams. The problem of the day was that transferring very large, confidential files from DC to Missouri took several hours to complete.

The application team was responsible for the transfers during non-business hours and they were burning the midnight oil each time due to the painfully slow process. Naturally, the frustration boiled over and everyone conveniently concluded that the network was slow and not doing its job properly. I visited them one morning, and by taking a different approach, we were able to solve the problem in time for lunch.

The application team checked the vital stats of the servers during the file transfer operation and made sure that system performance metrics were normal. The network team checked router configurations, looked for packet drops, duplex mismatches, congestion on links, etc. They even took packet captures using a free tool, and went through the daunting task of analyzing them line by line. For weeks no one could figure out the solution and the classic finger pointing battle continued.

The easiest way to avoid situations like this is to implement a solid application performance management (APM) process. It starts with monitoring end user experience, follows with triage techniques to isolate the problem to a particular IT domain, and then diagnoses the root cause using the right subset of forensic data. Unfortunately, this particular division didn’t have a good APM process despite that another division within the same agency did. Different teams in this division were using separate tools that didn’t speak the same language, and that were either too high level to find root causes of problems, or so low level they were hard use.

My mission was to educate them on best practices. My challenges were:
  1. The right kind of end user experience monitoring was not in place;
  2. I only had a few hours to spend on-site that morning;
  3. I only brought a single, software-based tool with me.
I used OPNET’s AppTransaction Xpert to attack the problem. It automatically determines the components of delay making up the overall response time of individual user transactions and presents them in a nice pie chart. Ideally, an AppResponse Xpert appliance is already installed in the data center to monitor all applications 24x7, do triage, and store forensic information so AppTransaction Xpert can do root cause analysis on already-reported problems. Absent access to this kind of appliance, I used AppTransaction Xpert to automate fresh transaction captures at network end points from my workstation.

With a few clicks, I was staring at the answer -- a system configuration issue that is often overlooked. AppTransaction Xpert automatically performed my analysis, and graphically isolated the problem. The TCP receive window was set to the "default" configuration on the server in Missouri, which was not ideal for this file transfer. I used the QuickPredict feature to determine and demonstrate the effect of a larger window size, which was tested by the application team shortly after I shared my report with them.

This provided a 50% improvement in response time! I did this entire analysis in less than 2 hours (!!) to the astonishment of the IT staff.

If they had AppResponse Xpert monitoring 24x7 and automatically recording forensic information, the same analysis would have taken closer to 20 minutes by leveraging the one-click data harvesting workflow with AppTransaction Xpert. Either way, 2 hours or 20 minutes, this particular IT team is ready to start working collaboratively as a team again. If they choose to work past dinner time in the future it should be for something more important than finger-pointing parties and file transfer baby-sitting duty!

1 comment:

Akshay Seshadrinathan said...

Great post Archana!

I wanted to let readers know that they can tune in to a free web briefing on April 21st to learn more - Quickly pinpoint end-user experience issues and resolve them using OPNET's APM solutions. Includes a live demo!