Wednesday, July 18, 2012

The Impact of Network Problems on Application Performance


By Terry Slattery

Welcome to my first blog post on apmmatters.com. I am a consultant at Chesapeake Netcraftsmen and I’ve been writing blogs for some time at Netcraftsmen about topics related to network operations and network management.  For this article, I’ll focus on the network problems that impact applications. These are problems that are relatively common, but that few people running networks seem to acknowledge as having a significant impact on applications.

I want to start by
taking a look at basic application performance and causes of slow performance.  Measuring application performance as users experience it is a fundamental APM best practice, and goes beyond monitoring network performance metrics.  Assuming this is done correctly, and we verify a true end-user performance issue, how does the support team determine the root cause? Let’s assume a modern, multi-tier application that includes an application user interface server, a database server, a SAN for data storage, VMotion to move the server images among several possible server hardware systems, multiple network interfaces, a multi-tier network infrastructure, and dependencies on other services like WINS or DNS for server name resolution.

It is often difficult to know where all the components are and which components are talking with which other components. For this reason, application dependency mapping is also a fundamental component of APM.  The SAN team may move the disk image from one storage system to another. There may be network contention at critical times on an important network interface. A duplex mismatch or an incorrect network teaming configuration may exist at the server’s connection to the network. Or the database queries made by the application server may be inefficient, causing large delays for some operations. A server configuration that references the address of a decommissioned DNS or WINS server may cause application slowness whenever the server attempts to use the decommissioned name server.

Network Problems That Affect Applications

Unfortunately, the IT and server teams rarely have the tools that allow them to easily determine what component of a complex application is not working correctly. There could indeed be network problems. I find that a lot of IT staff think that 1% packet loss is a small number and should not impact network traffic. So they ignore common sources of packet loss, thinking that the applications using that path won’t be adversely affected. Unfortunately, a very small amount of packet loss will have a big impact on TCP throughput, which in turn will affect the applications that depend on TCP. I recommend investigating any interface that has more than 0.0001% packet loss. The chart below shows the impact of 0.0001% packet loss on a 1Gbps link on the left. The other significant impact on throughput is the round trip time of the connection, which I’ve plotted as three separate curves.


Error Loss

Duplex mismatch is the source of packet loss that I most frequently encounter. Many organizations still hard-code speed and duplex settings because they were burned by problems back when the standards were new and devices did not correctly auto-negotiate duplex settings. A duplex mismatch will work for low traffic volumes, but the packet loss increases significantly as the volume increases. These errors are easy to spot because the interfaces will show high FCS errors and runts on the full-duplex interface and late collisions on the half-duplex interface.

I’ve also seen bad optical patch cables cause error loss. Alcohol swabs should be used on connections to remove dust and dirt from the ends of cables. Optical cable inspection microscopes should be used on questionable cables before putting them into use. Remember to practice safe optical networking and make sure that there is no laser present when you check a connector.

Note that UDP doesn’t incorporate flow control and will continue to send packets at whatever rate the application sends them. In many cases, this makes the problem worse because more packets add to network congestion and packet loss.

Congestion Loss

Interface congestion is another significant source of packet loss. Congestion is typically caused by multiple high-speed interfaces that are trying to send data over one egress interface. The egress interface may run with little congestion during off-peak hours, reducing the daily average packet loss to a percentage that makes it look like it isn’t a significant problem. However, looking at the statistics during peak hours shows packet loss that affects the applications.


Another source of packet errors is due to congestion within the network hardware. In a recent consulting engagement, I found a set of servers with 1Gbps NICs that were clustered on consecutive ports of one switch interface card. The blade happened to be reasonably old and the server traffic was congesting the ASIC that serviced that set of ports. The result was 0.1% average packet loss with much higher peaks. The applications on these servers were TCP-based, so there was a key source of application performance problem for all the applications running on those servers. The solution was to upgrade the switch interface card or to distribute the servers to other ports on the blade so that ASIC congestion didn’t occur.

Latency

Of course, excessive latency can also have a big impact on application performance. Latency typically becomes a factor in poorly written applications that perform many back-and-forth operations. An application that requires 100 round trips to query a database for the data that it needs would work fine in the development environment where the round trip latency would be a few milliseconds. However, when the application is deployed over an MPLS WAN with 100ms round trip latency, the same function would require 10,000ms to execute. If several of these actions need to be performed in sequence, then we see a poorly performing application.

A streaming application may also experience a big performance impact just from the increase in round trip time where there is some small amount of packet loss, perhaps due to a congested WAN link. So understanding how the network operates and its impact on the application is important.

Detecting Network Problems

There are several ways to identify network problems. Legacy tools could be used to identify interfaces that have high errors, but do they identify the interfaces that are truly impacting applications? By looking at application performance itself, it is possible to identify the paths and interfaces that are having the greatest impact on an end-user’s experience.

That’s where APM technologies with application dependency mapping and network analytics become important. They allow you to identify the interfaces that are having problems that affect the applications. They can automate the diagnosis of user-level problems because they understand the transport protocols and the impact of error loss, congestion, latency, and other factors on user transactions. Some systems can even show how a change to any of these parameters will impact specific transactions. For example, you’re planning to deploy an application to the remote offices across the US from the east coast data center. Run a test case with high latency and see what kind of application response your customers will experience.

In my next post, I’ll discuss non-network sources of application performance problems.

No comments: