Wednesday, September 5, 2012

Defending the Network from Application Performance problems (part II)



In my prior blog post, I wrote about different network problems that negatively impact application performance. In this post, I’ll follow up with non-network problems that impact application performance, but for which the network provides a unique vantage point from where such problems can be identified and solved. In the next post, I’ll tie everything together by describing how to determine if the network is at fault and how to get the other organizations to understand more about application performance.

Slow Client

Many modern web-based applications often push a bunch of the user interaction work to the client workstation. Sometimes it is done in a way that pushes a lot of data to the workstation where some JavaScript code processes the data. I’ve seen applications that had long, multi-second pauses because the JavaScript process had to handle hundreds or thousands of rows of data before the client display could be updated.

A good Application Performance Management (APM) system identifies clients that have these types of delays. It requires looking at the client-to-server transactions and identifying when the client is paused due to internal processing. The analysis needs to differentiate between the client workstation application pauses and the “think time” of the human who is interacting with the application.

Slow Server

The server teams don’t like to hear it, but the most common causes of slow application performance are the applications or the servers themselves. I’ve found that it frequently is not the network that is the cause, even though the network often gets the blame.

Modern applications are typically deployed on a multi-tiered infrastructure. There often is a front-end web server that talks with an application server. The application server in turn talks with a middleware server that queries one or more database servers for the data it needs. These servers may all talk with DNS servers to look up IP addresses or to map IP addresses back to server names. All it takes is for any one of these servers to have performance problems and the whole application runs slow. Of course, the problem is then one of identifying the slow server out of the set of servers that implement an application.

Understanding the interactions between multiple components in an application is an essential part of understanding the root cause of performance problems. This process, called Application Dependency Mapping, is typically part of an integrated APM approach, and ideally leverages information from already in-place monitoring solutions to draw a dependency map between system components. The network provides a unique vantage point to derive these relationships, and as such the network team can provide strong value to the application and server teams.  

Although we can collect a lot of very rich information from the network, using packet capture tools to answer the question of “Is it the network or the application?” could take many, many hours of work. All the while, the application is running slow, affecting the productivity of anyone using that application.

I’ve used Application Response Xpert to significantly reduce the time to identify why a slow application was slow. Once you have set up the proper monitoring points and some basic configurations, it is very easy to use  and provides immediate value for “the network is slow” fire drills. The information gathered by AppResponse Xpert also provides input to AppMapper Xpert, to automatically draw dependency maps of critical applications.

Identifying Database Scaling Problems

A common cause of application slowness is that the application was developed with a small data set on a fast LAN development environment. Then the application is rolled out to production. It may initially run with acceptable performance. But over time, as the database grows, it becomes slower and slower. A quick analysis with AppResponse Xpert shows that one of the key middleware servers is making a lot of requests to a database server. One client request can result in many database requests or perhaps result in the transfer of a significant volume of data. Changing the database query to be more efficient typically solves the problem.

I’ve also found the case where a database server takes many seconds to return data to the middleware or application server. The application team can use AppResponse Xpert’s Database Monitoring module to identify the offending query. Sometimes a good development team can look at the user transaction and quickly determine what queries are likely to be the culprit while other times, the application is making so many database queries that a SQL query analysis tool is really what is needed. In the cases I’ve seen, the queries were poorly structured, sometimes joining large tables that resulted in extremely long query times on production data sets. Simply rewriting the queries dropped the query times by several orders of magnitude. This is where these tools pay off. The advantage using deep packet inspection on the network to identify problems with SQL queries is that there is no overhead added to the database.  This is another example of how the network team can provide value to other IT teams.

Chatty Conversation

Another typical example of problems within the application is the chatty conversation. One application server, or perhaps the client itself, will make many, many small requests to execute one transaction on behalf of the person running the application. It runs fine as long as the network latency between the client and server is low. However, with the advent of virtualization, the server team may have configured automatic migration of the server image to a lightly loaded host. This might move a server image to a location that puts it several milliseconds further away from other servers or from its disk storage system. A few milliseconds may not be much unless the application does hundreds or perhaps thousands of small requests to complete one transaction.  Suddenly, the application goes from an acceptable level of performance to unacceptable performance. Of course, database size also affects the performance because the number of small requests goes up with the database size.

You need visibility into the number of requests between systems, where the systems are connected to the network, and the delays between requests. Getting a baseline of system performance against which you can measure future performance is extremely useful for identifying whether a given application is performing as expected and possibly identifying which server needs to be examined.

This kind of examination can be automated by AppTransaction Xpert, which can capture baseline transactions from the packet store of AppResponse Xpert and predict the change in their response times given different network parameters such as latency, bandwidth, and loss rate.

Slow Network Services

Finally, the problem may be due to slow network services. This isn’t the network itself, but services that most network-based applications depend upon for proper operation. Consider an application that makes queries to a DNS server, but the primary DNS doesn’t exist, so the app must time out the first request before attempting to query the second DNS server. I’ve seen applications that would have a 30-60 second delay upon the first execution, but would then run fine for a while. Periodically, the application would be very slow, but run fine the rest of the time. Intermittent problems are very challenging to diagnose, so this is where having something like AppResponse Xpert watching and recording all the transactions is extremely helpful. Just identify the time of the slow performance and look for something in the data. In this case, it would be an unanswered DNS request, which was successful when tried against the secondary DNS server.

Summary

Accurately diagnosing application performance can be impossible or very time consuming with the wrong tools. With the right tools and a good installation, where the tools capture the necessary data, the analysis and diagnosis can proceed very quickly. In addition, these tools not only help to defend or troubleshoot the network, but also provide value to other IT teams in the organization. I know of one site that went from not being able to help diagnose slow applications, to being able to provide deep visibility into what an application is doing from the network perspective, and providing real value to the application teams to solve the problem.

No comments: