By Terry Slattery
Welcome to my first blog post on apmmatters.com. I am a consultant
at Chesapeake Netcraftsmen and I’ve been writing blogs for some time at
Netcraftsmen about topics related to network operations and network management. For this article, I’ll focus on the network
problems that impact applications. These are problems that are relatively
common, but that few people running networks seem to acknowledge as having a
significant impact on applications.
I want to start by
taking a look at basic application performance and causes of slow performance. Measuring application performance as users experience it is a fundamental APM best practice, and goes beyond monitoring network performance metrics. Assuming this is done correctly, and we verify a true end-user performance issue, how does the support team determine the root cause? Let’s assume a modern, multi-tier application that includes an application user interface server, a database server, a SAN for data storage, VMotion to move the server images among several possible server hardware systems, multiple network interfaces, a multi-tier network infrastructure, and dependencies on other services like WINS or DNS for server name resolution.
taking a look at basic application performance and causes of slow performance. Measuring application performance as users experience it is a fundamental APM best practice, and goes beyond monitoring network performance metrics. Assuming this is done correctly, and we verify a true end-user performance issue, how does the support team determine the root cause? Let’s assume a modern, multi-tier application that includes an application user interface server, a database server, a SAN for data storage, VMotion to move the server images among several possible server hardware systems, multiple network interfaces, a multi-tier network infrastructure, and dependencies on other services like WINS or DNS for server name resolution.
It is often difficult to know where all the components are
and which components are talking with which other components. For this reason,
application dependency mapping is also a fundamental component of APM. The SAN team may move the disk image from one
storage system to another. There may be network contention at critical times on
an important network interface. A duplex mismatch or an incorrect network
teaming configuration may exist at the server’s connection to the network. Or
the database queries made by the application server may be inefficient, causing
large delays for some operations. A server configuration that references the
address of a decommissioned DNS or WINS server may cause application slowness
whenever the server attempts to use the decommissioned name server.
Network Problems That Affect Applications
Unfortunately, the IT and server teams rarely have the tools
that allow them to easily determine what component of a complex application is
not working correctly. There could indeed be network problems. I find that a
lot of IT staff think that 1% packet loss is a small number and should not
impact network traffic. So they ignore common sources of packet loss, thinking
that the applications using that path won’t be adversely affected.
Unfortunately, a very small amount of packet loss will have a big impact on TCP
throughput, which in turn will affect the applications that depend on TCP. I
recommend investigating any interface that has more than 0.0001% packet loss.
The chart below shows the impact of 0.0001% packet loss on a 1Gbps link on the
left. The other significant impact on throughput is the round trip time of the
connection, which I’ve plotted as three separate curves.
Error Loss
Duplex mismatch is the source of packet loss that I most
frequently encounter. Many organizations still hard-code speed and duplex
settings because they were burned by problems back when the standards were new
and devices did not correctly auto-negotiate duplex settings. A duplex mismatch
will work for low traffic volumes, but the packet loss increases significantly
as the volume increases. These errors are easy to spot because the interfaces
will show high FCS errors and runts on the full-duplex interface and late
collisions on the half-duplex interface.
I’ve also seen bad optical patch cables cause error loss.
Alcohol swabs should be used on connections to remove dust and dirt from the
ends of cables. Optical cable inspection microscopes should be used on
questionable cables before putting them into use. Remember to practice safe
optical networking and make sure that there is no laser present when you check
a connector.
Note that UDP doesn’t incorporate flow control and will
continue to send packets at whatever rate the application sends them. In many
cases, this makes the problem worse because more packets add to network
congestion and packet loss.
Congestion Loss
Interface congestion is another significant source of packet loss. Congestion is typically caused by multiple high-speed interfaces that are trying to send data over one egress interface. The egress interface may run with little congestion during off-peak hours, reducing the daily average packet loss to a percentage that makes it look like it isn’t a significant problem. However, looking at the statistics during peak hours shows packet loss that affects the applications.
Another source of packet errors is due to congestion within
the network hardware. In a recent consulting engagement, I found a set of
servers with 1Gbps NICs that were clustered on consecutive ports of one switch interface
card. The blade happened to be reasonably old and the server traffic was
congesting the ASIC that serviced that set of ports. The result was 0.1%
average packet loss with much higher peaks. The applications on these servers
were TCP-based, so there was a key source of application performance problem
for all the applications running on those servers. The solution was to upgrade
the switch interface card or to distribute the servers to other ports on the
blade so that ASIC congestion didn’t occur.
Latency
Of course, excessive latency can also have a big impact on
application performance. Latency typically becomes a factor in poorly written
applications that perform many back-and-forth operations. An application that
requires 100 round trips to query a database for the data that it needs would
work fine in the development environment where the round trip latency would be
a few milliseconds. However, when the application is deployed over an MPLS WAN
with 100ms round trip latency, the same function would require 10,000ms to
execute. If several of these actions need to be performed in sequence, then we
see a poorly performing application.
A streaming application may also experience a big
performance impact just from the increase in round trip time where there is
some small amount of packet loss, perhaps due to a congested WAN link. So
understanding how the network operates and its impact on the application is
important.
Detecting Network Problems
There are several ways to identify network problems. Legacy
tools could be used to identify interfaces that have high errors, but do they
identify the interfaces that are truly impacting applications? By looking at
application performance itself, it is possible to identify the paths and
interfaces that are having the greatest impact on an end-user’s experience.
That’s where APM technologies with application dependency
mapping and network analytics become important. They allow you to identify the
interfaces that are having problems that affect the applications. They can
automate the diagnosis of user-level problems because they understand the
transport protocols and the impact of error loss, congestion, latency, and
other factors on user transactions. Some systems can even show how a change to
any of these parameters will impact specific transactions. For example, you’re
planning to deploy an application to the remote offices across the US from the
east coast data center. Run a test case with high latency and see what kind of
application response your customers will experience.
In my next post, I’ll discuss non-network sources of
application performance problems.

No comments:
Post a Comment