In my prior blog posts, I wrote about different network
problems and non-networking
problems that impact application performance. In this post, I’ll tie
everything together by describing how to determine if the network is at fault
and how to get the other organizations to understand more about application
performance.
Application Performance Management (APM) is the science of
understanding all the factors that impact application performance. As I
discussed in the earlier posts, it could be network factors such as errors
(link errors or duplex mismatch), congestion (which causes discards), and high
network latency (due to network topology or slow firewalls). Server problems
include high
CPU utilization, server memory utilization, high I/O loads, and underpowered client workstations. Increased latency between components in a multi-tier application could be causing the slowdown. Finally, the growth of the application database could be the limiting factor in application performance. That is nine factors and I’m sure that there are more.
CPU utilization, server memory utilization, high I/O loads, and underpowered client workstations. Increased latency between components in a multi-tier application could be causing the slowdown. Finally, the growth of the application database could be the limiting factor in application performance. That is nine factors and I’m sure that there are more.
The presence of more than one factor typically results in extreme application slowness. Individually, each factor may look marginally ok,
but in combination, the effect is compounded. The complex interaction between
performance factors makes it challenging to determine the causes of poor
application performance.
Application Performance Management
For every variable in an application, a valid test
methodology has to be developed, the test executed, the data collected, and
analysis performed. If the test is conclusive, then you will have either found
a problem or ruled out a variable. This process can be as short as a few hours
or as long as a few weeks, depending on the tools that are available and the
type of analysis that must be performed.
For basic network statistics, it may be possible to use a
packet capture tool to look at a few transactions to determine if there really
obvious network problems. TCP retransmissions indicate packet loss somewhere.
Duplicate ACKs indicate that there is too much buffering somewhere and
duplicate packets are being received by either the application or by the
client. But what if it isn’t that simple?
We frequently see applications where it isn’t a clear
networking problem. While we could use packet capture tools to eventually
diagnose such problems, it would take a
long, long time. That’s where APM is
valuable.
Simply collecting and sorting through the volume of data
necessary to properly analyze an application is difficult for humans and is
easy for APM. The APM system will look for obvious factors like the timing of
requests and replies as they transit the network. It will also look for network
factors, such as packet loss, jitter, and latency. One of the cool things it
will do is to look at the number and size of the packets that transit between
clients and servers or between servers. Is a lot of data being moved? What is
the direction of the big data flows? Is an application “chatty”, sending a lot
of small packets? A chatty application will work well in a LAN environment
where latency is small but won’t work well at on a WAN connection where latency
becomes significant. This is why an application that is designed to work in the
corporate environment often fails when its deployment is expanded to remote
offices.
APM In The Data Center
One of the fundamental requirements of a good APM deployment
is to provide it visibility into all parts of an application. A Multi-tiered
application will need to have each tier’s network communications visible to the
APM system. In the deployments we’ve done, spanning server VLANs to the APM
server best provides the connectivity. The server seldom has enough interfaces
to handle all the links that are needed, so we use a Span Port Aggregation
system to provide the necessary connectivity. Examples include Gigamon and
Anue. The advantage of using these devices in the data center is that you can
connect a bunch of network connections to data center switches and filter the
data before it is sent to the APM server. This reduces the load on the APM
server so that it is more responsive. The tradeoff is having enough data to do
post-mortem analysis of unexpected problems. Another advantage is that the key
switches in the data center can be pre-configured with SPAN ports and links to
the span port aggregation system, allowing the IT organization to quickly begin
analyzing a potential problem without having to run cables. This is a big
advantage in businesses where the applications are a key component in revenue
generation.
Once the raw data is being collected, it is possible for the
APM system to create an Application Dependency Map that shows which systems in
the application architecture are talking to and depend on other systems.
Understanding the dependency map is critical to understanding the application.
I’ve seen examples of applications where the IT team thought the application
was three tiers. When they used the APM dependency mapping technology, they
found that the application had evolved to incorporate several additional tiers,
each of which introduced additional processing latency. The result was a slow
application.
Stop Blaming Each Other; Find the Root Cause
In the deployments that we’ve done, we find that after the
APM system is used a few times to diagnose a slow application, the application
and server teams start to appreciate the capability and begin asking for
assistance in determining why an application is slow. The big advantage of the
cooperation is that it creates a less adversarial communications channel between
the networking and applications teams. Historically, the network team gets the
blame for slow applications and must work to demonstrate where the problem
lies. We’ve often joked at conferences that the real metric in IT systems is
“Mean Time to Innocence”, not Mean Time to Repair. A difficult problem may get
tossed back and forth between the network team and the application team as they
each work on it from their own perspective. The network team is often forced to
become familiar with an application to be able to diagnose problems with it.
That’s very inefficient. With APM tools, the groups focus on improving the overall
application performance, not blaming each other for something that they rarely
have data to substantiate. When members of both the systems group and networking group start using APM tools together, they create a more cooperative work environment
in which the causes of problems are more quickly identified and resolved. And
that’s where the return on investment of APM tools generates its payback.
No comments:
Post a Comment