by Jon
I am often asked how a general APM practitioner can solve very specific problems that require deep domain expertise outside his or her comfort zone. My recent experience at a national household appliance manufacturer is a great example to illustrate the answer. They reported intermittent, yet extreme slowness on a critical IIS/.NET application. After months of unsuccessful troubleshooting on each supporting server, their business could not afford to wait longer. So they decided to call my employer to apply APM methodologies, and I was assigned.
I deployed OPNET AppInternals Xpert to monitor thousands of performance metrics per second on each server related to the application, collecting metrics from IIS, .NET classes & methods, DB queries, SQL Server and the Windows+Solaris Operating Systems. I collected data for a couple of hours during their peak usage time.
Although most people want to immediately jump to “the solution”, the first step is to clearly identify “the problem” because intermittent slowness is a bit too vague to be useful. AppInternals Xpert had measured every transaction from every user for the last several hours, and reported not only the average response times, but the 95th percentiles that allowed me to identify the transactions that were “OK most of the time, but intermittently much slower”. Only the ViewReport.aspx transaction stood out as being problematic. Trending ViewReport.aspx’s performance over time clearly showed intermittent performance spikes. We now had a very clear problem definition, so the next step was to identify the root cause.
I started by looking at the “usual suspects” -- the metrics most frequently associated with performance issues. CPU Load, CPU & IO queuing, etc. -- but none of these had any issues. I had exhausted my personal ability to solve this issue… unless I had guidance from someone, or something else.
I then leveraged a very powerful statistical correlation feature in AppInternals Xpert to identify the “unusual suspects” – those things that I did not know to look for and those I knew to look for but forgot. The feature allowed me to very easily right-click on the response time chart of ViewReport.aspx and simply choose “Show Correlated Metrics”. AppInternals Xpert then mined through the tens of thousands of metrics it collected to identify those that had a behavior pattern similar to the response time of ViewReport.aspx. Only one metric stood out as being highly correlated: “DNLC cache misses” on the Solaris fileserver. I had never heard of this metric, yet it was clear that it was related to our problem. Every single time ViewReport.aspx on the Windows IIS server was slow, DNLC cache misses on the Solaris fileserver increased.
Some quick research on the Internet about Solaris’ DNLC cache helped me get closer to finding the root cause. I learned that it’s an index to the location of files, similar to a DB Index. However the index is built as files are accessed, so there’s always a delay the first time a file is opened, especially if the directory contains thousands of files. This is very similar to a “Table Scan” on a database.
Armed with this information, I asked the customer if they had any directories on the Solaris fileserver that had thousands of files in it. They informed me that their report directory contained over 90,000 PDF files – we had found our root cause! The solution was to use a nested directory structure where the directory names corresponded to part of the filename (eg. /abc/xyz/abcxyz123.pdf), thus reducing the number of files in any single directory. This simple solution eliminated the DNLC delay and response time returned to a very consistent and acceptable level.
With just a few hours of effort using OPNET AppInternals Xpert we were able to solve an issue that had hurt the customer’s business for months!
With other general monitoring & troubleshooting tools, your success at solving problems is often heavily dependent on the subject matter expertise of the person doing the troubleshooting. Unless you know what questions to ask, you won’t get the correct answer. Perhaps more important: the right kind of domain expert is often not even engaged until many others try and fail. More automated solutions like OPNET AppInternals Xpert guide the troubleshooter to the right problem domain with the specific information required to fix the root cause. This is one of the main reasons we hear about dramatic APM success stories in which problems are solved in hours rather than days or weeks or, in this case, months.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment