The Case of the Slow Applications

By Terry Slattery

I completed an exciting engagement recently with a client. This story below is based directly on this experience, from which you should learn something if you are involved in solving application performance problems.

I’m sitting in the CEO’s conference room, along with the CIO and the heads of all the IT departments: networking, network management, applications, and servers, all nervously making
small talk. The CEO enters the room and starts.

“We’ve had a serious problem that looks like a network problem, but no one has been able to determine its cause.”

The CEO looks around the room and I note the worried looks that are exchanged among the attendees.

“We know that it exists because our customers tell us about the slow application response. It isn’t just one application; it seems to be all our applications and it hits hardest when we have more customers trying to use the system. It is affecting our business and we need to fix it before I have to report to the Board and divulge it in an analyst call.”

She continues, “If it isn’t fixed this week, I’m going to start replacing people until we find someone who can find and fix it.”

The worried looks get more intense.

This is the type of case I enjoy. Everyone who had looked at it has failed. It is high stakes with a lot of finger pointing going on. There are few facts and many possibilities. The CIO had heard of Netcraftsmen from another CIO as a company that could solve impossible problems.

I start to get as much information as I could. “What network management tools do you have?” The answer wasn’t assuring. There were several simple tools, but nothing that could tackle this problem. It would be like using a small flashlight to find a golf ball in a stadium; it could be done, but would take a long time. They needed something more powerful and I hoped that I had the necessary tools with me.

“I’ll need a few SPAN ports setup where I can capture the application traffic,” I replied, thinking of the tools I had in my backpack.

The CEO looked coldly at me, “We’ll get you whatever you need. When do you think you’ll have an answer?”

“I should know something tomorrow afternoon,” I replied, hoping that I would have enough data to at least provide an update, if not an answer.

“Good,” she continued, “Ted will be your contact. He will get you anything you need. We will meet again here tomorrow afternoon for your update.”

On the walk to the computer room, Ted confided, “Our tools aren’t that great. I’ve suspected a problem for some time, but my boss won’t let me investigate. Maybe he thinks it is a network problem that will make us look bad.”

“We’ll see. I have some tools that should help identify the problem or at least gather enough information that we can begin to narrow it down.”

We get the SPAN ports setup and connect OPNET’s AppResponse Xpert. It starts capturing traffic, which we use to identify groups of IP addresses. A separate group is identified for other servers, for customer subnets, and for infrastructure subnets like DNS and NTP. This process will allow us to identify traffic by business group, allowing us to focus on which groups are having specific problems.

As we worked, I was able to get an idea of the network topology and configuration to help identify where problems might exist.

It didn’t take long to get enough data to begin to identify a problem. All TCP sessions were exhibiting high retransmissions, and indication of packet loss. The retransmissions were on TCP sessions to customers and to other servers. If the problem were outside the data center, it would only affect the customer connections. But since the intra-data center connections were also affected, it had to be a problem within the data center.

“Ted, the problem seems to be inside the data center. Let’s start to get some info from the switches as a start. It might be an interface problem or congestion, since it always happens under high load.”

The ‘show interfaces’ output doesn’t look interesting. No signs of drops that would indicate congestion. No output errors on any interfaces. Then I spot something that I’ve not seen before: Ingress Overrun with a high count. The interfaces that had ingress overruns were the servers in question. A web search told us that an ingress overrun occurs when an incoming Layer 2 frame is dropped because the previous Layer 2 frame had not been transferred from the interface buffer to system buffers. It looks like our smoking gun, but what was causing it?

We gathered more data. The servers with problems were on ports 2, 3, 6, and 7of one blade in the switch. I looked at Ted, “Are you thinking what I’m thinking?”

“Yes, I think I am. Let’s see, this is a 6148 blade. We’ve wanted to replace these blades for some time, but we can’t get a maintenance window to do the change because all these servers are on the same blade. Someone always has a reason to deny the maintenance request.”

My reply is preceded by a groan, “I’ll bet that the blade upgrade happens now. Let’s see what else is on this switch. Maybe we can recommend that one server be moved at a time, until you can get to the point where the blade can be replaced.”

Analysis over the remaining day confirmed our findings. Four high-volume servers were overrunning the ASIC that serviced the first 8 ports on the blade. It was an old blade that should have been replaced years ago. It was only a problem when the servers were running at high utilization. The QA team couldn’t find a problem because they were only testing one application at a time. Only with production traffic would all four servers be simultaneously running at high utilization.

The meeting the next afternoon went smoothly and everyone was relieved. It wasn’t really anyone’s fault because the networking team was never given a maintenance window to replace the blade. The CEO was able to get the stakeholders to determine an acceptable maintenance window so that the upgrade could happen. Of course, we were heroes for finding the problem so quickly.

At the meeting, the CEO asked, “What tools did you use to find the problem so quickly?”

“It is a combination of experience and OPNET AppResponse Xpert. We use it frequently in our network assessments and in troubleshooting challenging problems like the one here.”

She turned to the CIO, “Bob, put that tool on your purchase list. I never want to be in this situation again. I’ll set aside funds for the purchase, just let me know how much it will be.”

Another case successfully solved with OPNET AppResponse Xpert!

3 comments:

Anonymous said...: You had an ARX appliance in your backpack? Or how did you capture with just SPANs? HW laying around.; December 22, 2012 at 5:35 PM
Unknown said...: Great read! Thanks for sharing how your experience and our AppResponse can resolve any end-user experience issues so quickly!; February 11, 2013 at 11:09 AM
Unknown said...: Excellent story detailing step by step the client's side and how our solutions saved again a company's dilemma with slow applications and nobody was fired!; February 25, 2013 at 9:39 AM

Friday, December 21, 2012