Wednesday, September 12, 2012

Defending the Network from Application Performance Problems (part III)

by Terry Slattery

In my prior blog posts, I wrote about different network problems and non-networking problems that impact application performance. In this post, I’ll tie everything together by describing how to determine if the network is at fault and how to get the other organizations to understand more about application performance.

Application Performance Management (APM) is the science of understanding all the factors that impact application performance. As I discussed in the earlier posts, it could be network factors such as errors (link errors or duplex mismatch), congestion (which causes discards), and high network latency (due to network topology or slow firewalls). Server problems include high
CPU utilization, server memory utilization, high I/O loads, and underpowered client workstations. Increased latency between components in a multi-tier application could be causing the slowdown. Finally, the growth of the application database could be the limiting factor in application performance. That is nine factors and I’m sure that there are more.

The presence of more than one factor typically results in extreme application slowness. Individually, each factor may look marginally ok, but in combination, the effect is compounded. The complex interaction between performance factors makes it challenging to determine the causes of poor application performance.

Application Performance Management

For every variable in an application, a valid test methodology has to be developed, the test executed, the data collected, and analysis performed. If the test is conclusive, then you will have either found a problem or ruled out a variable. This process can be as short as a few hours or as long as a few weeks, depending on the tools that are available and the type of analysis that must be performed.

For basic network statistics, it may be possible to use a packet capture tool to look at a few transactions to determine if there really obvious network problems. TCP retransmissions indicate packet loss somewhere. Duplicate ACKs indicate that there is too much buffering somewhere and duplicate packets are being received by either the application or by the client. But what if it isn’t that simple?

We frequently see applications where it isn’t a clear networking problem. While we could use packet capture tools to eventually diagnose such problems, it would take a long, long time. That’s where APM is valuable.

Simply collecting and sorting through the volume of data necessary to properly analyze an application is difficult for humans and is easy for APM. The APM system will look for obvious factors like the timing of requests and replies as they transit the network. It will also look for network factors, such as packet loss, jitter, and latency. One of the cool things it will do is to look at the number and size of the packets that transit between clients and servers or between servers. Is a lot of data being moved? What is the direction of the big data flows? Is an application “chatty”, sending a lot of small packets? A chatty application will work well in a LAN environment where latency is small but won’t work well at on a WAN connection where latency becomes significant. This is why an application that is designed to work in the corporate environment often fails when its deployment is expanded to remote offices.

APM In The Data Center

One of the fundamental requirements of a good APM deployment is to provide it visibility into all parts of an application. A Multi-tiered application will need to have each tier’s network communications visible to the APM system. In the deployments we’ve done, spanning server VLANs to the APM server best provides the connectivity. The server seldom has enough interfaces to handle all the links that are needed, so we use a Span Port Aggregation system to provide the necessary connectivity. Examples include Gigamon and Anue. The advantage of using these devices in the data center is that you can connect a bunch of network connections to data center switches and filter the data before it is sent to the APM server. This reduces the load on the APM server so that it is more responsive. The tradeoff is having enough data to do post-mortem analysis of unexpected problems. Another advantage is that the key switches in the data center can be pre-configured with SPAN ports and links to the span port aggregation system, allowing the IT organization to quickly begin analyzing a potential problem without having to run cables. This is a big advantage in businesses where the applications are a key component in revenue generation.

Once the raw data is being collected, it is possible for the APM system to create an Application Dependency Map that shows which systems in the application architecture are talking to and depend on other systems. Understanding the dependency map is critical to understanding the application. I’ve seen examples of applications where the IT team thought the application was three tiers. When they used the APM dependency mapping technology, they found that the application had evolved to incorporate several additional tiers, each of which introduced additional processing latency. The result was a slow application.

Stop Blaming Each Other; Find the Root Cause

In the deployments that we’ve done, we find that after the APM system is used a few times to diagnose a slow application, the application and server teams start to appreciate the capability and begin asking for assistance in determining why an application is slow. The big advantage of the cooperation is that it creates a less adversarial communications channel between the networking and applications teams. Historically, the network team gets the blame for slow applications and must work to demonstrate where the problem lies. We’ve often joked at conferences that the real metric in IT systems is “Mean Time to Innocence”, not Mean Time to Repair. A difficult problem may get tossed back and forth between the network team and the application team as they each work on it from their own perspective. The network team is often forced to become familiar with an application to be able to diagnose problems with it. That’s very inefficient. With APM tools, the groups focus on improving the overall application performance, not blaming each other for something that they rarely have data to substantiate. When members of both the systems group and networking group start using APM tools together, they create a more cooperative work environment in which the causes of problems are more quickly identified and resolved. And that’s where the return on investment of APM tools generates its payback.

No comments: