I completed an exciting engagement recently with a
client. This story below is based
directly on this experience, from which you should learn something if you are
involved in solving application performance problems.
I’m sitting in the CEO’s conference room, along with the CIO
and the heads of all the IT departments: networking, network management,
applications, and servers, all nervously making
small talk. The CEO enters the room and starts.
small talk. The CEO enters the room and starts.
“We’ve had a serious problem that looks like a network
problem, but no one has been able to determine its cause.”
The CEO looks around the room and I note the worried looks
that are exchanged among the attendees.
“We know that it exists because our customers tell us about
the slow application response. It isn’t just one application; it seems to be
all our applications and it hits hardest when we have more customers trying to
use the system. It is affecting our business and we need to fix it before I
have to report to the Board and divulge it in an analyst call.”
She continues, “If it isn’t fixed this week, I’m going to
start replacing people until we find someone who can find and fix it.”
The worried looks get more intense.
This is the type of case I enjoy. Everyone who had looked at
it has failed. It is high stakes with a lot of finger pointing going on. There
are few facts and many possibilities. The CIO had heard of Netcraftsmen from
another CIO as a company that could solve impossible problems.
I start to get as much information as I could. “What network
management tools do you have?” The answer wasn’t assuring. There were several simple
tools, but nothing that could tackle this problem. It would be like using a
small flashlight to find a golf ball in a stadium; it could be done, but would
take a long time. They needed something more powerful and I hoped that I had
the necessary tools with me.
“I’ll need a few SPAN ports setup where I can capture the application
traffic,” I replied, thinking of the tools I had in my backpack.
The CEO looked coldly at me, “We’ll get you whatever you
need. When do you think you’ll have an answer?”
“I should know something tomorrow afternoon,” I replied, hoping
that I would have enough data to at least provide an update, if not an answer.
“Good,” she continued, “Ted will be your contact. He will
get you anything you need. We will meet again here tomorrow afternoon for your
update.”
On the walk to the computer room, Ted confided, “Our tools
aren’t that great. I’ve suspected a problem for some time, but my boss won’t
let me investigate. Maybe he thinks it is a network problem that will make us
look bad.”
“We’ll see. I have some tools that should help identify the
problem or at least gather enough information that we can begin to narrow it
down.”
We get the SPAN ports setup and connect OPNET’s AppResponse
Xpert. It starts capturing traffic, which we use to identify groups of IP
addresses. A separate group is identified for other servers, for customer
subnets, and for infrastructure subnets like DNS and NTP. This process will
allow us to identify traffic by business group, allowing us to focus on which
groups are having specific problems.
As we worked, I was able to get an idea of the network
topology and configuration to help identify where problems might exist.
It didn’t take long to get enough data to begin to identify
a problem. All TCP sessions were exhibiting high retransmissions, and
indication of packet loss. The retransmissions were on TCP sessions to
customers and to other servers. If the problem were outside the data center, it
would only affect the customer connections. But since the intra-data center
connections were also affected, it had to be a problem within the data center.
“Ted, the problem seems to be inside the data center. Let’s
start to get some info from the switches as a start. It might be an interface
problem or congestion, since it always happens under high load.”
The ‘show interfaces’ output doesn’t look interesting. No
signs of drops that would indicate congestion. No output errors on any
interfaces. Then I spot something that I’ve not seen before: Ingress Overrun
with a high count. The interfaces that had ingress overruns were the servers in
question. A web search told us that an ingress overrun occurs when an incoming
Layer 2 frame is dropped because the previous Layer 2 frame had not been
transferred from the interface buffer to system buffers. It looks like our
smoking gun, but what was causing it?
We gathered more data. The servers with problems were on
ports 2, 3, 6, and 7of one blade in the switch. I looked at Ted, “Are you
thinking what I’m thinking?”
“Yes, I think I am. Let’s see, this is a 6148 blade. We’ve
wanted to replace these blades for some time, but we can’t get a maintenance
window to do the change because all these servers are on the same blade.
Someone always has a reason to deny the maintenance request.”
My reply is preceded by a groan, “I’ll bet that the blade
upgrade happens now. Let’s see what else is on this switch. Maybe we can
recommend that one server be moved at a time, until you can get to the point
where the blade can be replaced.”
Analysis over the remaining day confirmed our findings. Four
high-volume servers were overrunning the ASIC that serviced the first 8 ports
on the blade. It was an old blade that should have been replaced years ago. It
was only a problem when the servers were running at high utilization. The QA
team couldn’t find a problem because they were only testing one application at
a time. Only with production traffic would all four servers be simultaneously
running at high utilization.
The meeting the next afternoon went smoothly and everyone
was relieved. It wasn’t really anyone’s fault because the networking team was
never given a maintenance window to replace the blade. The CEO was able to get
the stakeholders to determine an acceptable maintenance window so that the
upgrade could happen. Of course, we were heroes for finding the problem so
quickly.
At the meeting, the CEO asked, “What tools did you use to
find the problem so quickly?”
“It is a combination of experience and OPNET AppResponse
Xpert. We use it frequently in our network assessments and in troubleshooting
challenging problems like the one here.”
She turned to the CIO, “Bob, put that tool on your purchase
list. I never want to be in this situation again. I’ll set aside funds for the
purchase, just let me know how much it will be.”
Another case successfully solved with OPNET AppResponse
Xpert!
3 comments:
You had an ARX appliance in your backpack? Or how did you capture with just SPANs? HW laying around.
Great read! Thanks for sharing how your experience and our AppResponse can resolve any end-user experience issues so quickly!
Excellent story detailing step by step the client's side and how our solutions saved again a company's dilemma with slow applications and nobody was fired!
Post a Comment