Wednesday, July 20, 2011

Making Sense of “End User Experience Monitoring”


By Russ
While I can appreciate the need to adapt marketing to whatever the latest trends are, I also believe it’s important to avoid misleading marketing. Have you noticed the number of IT management vendors that claim to do APM and end user experience monitoring? Many of these vendors are jumping on a bandwagon they don’t belong on.
Even as an expert in the field, I find it difficult to differentiate the various products based on reading websites and brochures. Some of the claims I’ve read are fairly dubious. To organize the different approaches in my mind, I came up with a taxonomy. Hopefully you find it as useful as I do.
In my taxonomy, there are 5 levels of user response time measurement. There may be different approaches to collecting the data for each level but it’s the output that really matters. What is really being measured? The levels progress from least to most sophisticated.

  • Level 0 : Resource Utilization. This really shouldn't be listed as a level but I see many vendors trying to sneak this through. They claim that knowing when links or CPUs are heavily utilized is somehow indicative of user response time problems. I agree that resource utilization is important for troubleshooting, but it is a bad indicator of user experience. Everyone knows 100% is bad. But is 90% OK? What about 70%? Does 30% mean everyone is happy? Don’t plenty of application problems occur when link and CPU are both below 50%? Don't mistake SNMP or NetFlow tools for user response time monitoring solutions.
  • Level 1 : Network Round Trip time. Many of the legacy network monitoring vendors are now claiming that their round-trip time measurements are actually the same as user experience measurements. You can generally recognize these tools by the numbers they report. The vendors show demos in which response time jumps from 100ms to 140ms. They claim that this change indicates a problem with user experience. On its surface, this doesn't make sense. Do you think you would notice if a web page response time changed by 4/10 th of a second? Round-trip time is effectively "ping" which can be very useful when troubleshooting network problems, but it really does not represent user experience.
  • Level 2 : Round Trip with Server Processing. Some network monitoring vendors go one step further. They calculate both network round-trip and server delay for the specific request. This can be useful to differentiate between network and server slowdowns but it really isn't reporting on user experience either. Let me give you an example. You download a 10GB file. The network round-trip is 100ms. And the server processes the request for 2 seconds before streaming the file. A level 2 approach will result in a 2.1 second measurement even though the entire file may take 10 minutes to download.
  • Level 3 : Object-Level Response time. This is really the first level at which a vendor can credibly claim to focus on end-user experience. At Level 3, a tool is measuring an entire transaction from start to finish: from the time the client initiates the request to the time the final piece of data is delivered. For the example in Level 2 above, this method would correctly report a response time of 10 minutes because it incorporates the network round trip, the server delay, and the data delivery. For many applications (email, file transfer, etc.) this may be all you need.
  • Level 4 : Page-Level Response Time. Web applications, which we see more of every day, usually cause multiple object-level transactions for a single user transaction. Go to most web pages, and you will notice it is made of dozens of smaller objects (pictures, text, icons, menus, etc.). To measure the user experience of web pages, a tool must be able to stitch all of the individual objects into the larger page. So the ultimate measurement of the user experience is from the start of the first object to the final delivery of the last object. If the people you support are clicking on important web pages, this is the level you need to be at to most directly manage performance.
  • Level 5 Business-Level Transactions. As you use a web application, you don't just randomly hit individual pages (unless you’ve had too many martinis or are just bored). Most people step through specific pages in a sequence to complete a specific task. Think about checking out of a online store (view shopping cart, submit order, billing info, shipping info, confirmation, etc.). Each one of these steps may involve one or more web pages. Level 5 follows users through these business transaction sequences to report on user activity as well as user performance. It is important to know if any particular kind of business transaction is slow or is causing many people to abort. Business people, not just IT people, are often interested in the results from this level because business transactions and user activity are more directly tied to business metrics.
In the end, all 5 levels are useful at different times to different teams. They provide a wide range of data to enable high-level reporting as well as deep-dive troubleshooting. Some vendors focus only on the top layers (3-5). Some vendors focus only on the low layers (0-2) and these seem to be the ones that are adding the most confusion to the market. Hopefully, this taxonomy will help you differentiate for yourself.