Modern monitoring software makes it easy to plot a statistic like average latency every minute—too easy. Fancy dashboards of time series plots often lull us into a false sense of security. Underneath every point on those plots is a distribution, and underneath that distribution is a series of individuals: your customers. If you don’t take the time to look deeply at your data, you don’t truly understand your business.
John has been extracting value from large datasets for over 15 years at companies ranging from hedge funds to small data-driven startups to amazon.com. He has deep experience in machine learning, data visualization, website performance and real-time fault analysis. An empiricist at heart, John’s optimism and can-do attitude make “Just do the experiment!” his favorite call to arms.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Yvonne Romaine at yromaine@oreilly.com
Download the Velocity Sponsor/Exhibitor Prospectus
View a complete list of Velocity contacts
Comments
Excellent talk. Simple, relevant to everybody in the room and very well delivered. A+
Nice session. However there are some things that John said that I would like to add with respect to the particular problem that he illustrated for looking at the raw data. While I don’t debate the fact that sometimes logs need to pored over to arrive at the exact issue, I think that should happen at the very last stage where you have at least a log or better a part of the log that you want to look into. In John’s case, I would have liked to have gotten to the point where I know which log I am looking for. Andthe way to do it is to gather more graphical data sliced and diced by different contexts such as the host (in this particular example). If you had a graph where you could track latencies per host, you could have easily seen the particular host or a bunch of hosts as the outlier. And this is precisely what we do in my company. There is a downside though – how many contexts can you allow. You have to track latencies per host, per web service call, per partner, per customer etc. and the list could get very big. Then there is the issue of how many values a particular context can have. If you wanted to track latencies per host and there are a million of them, there are possibly a million lines in your graph (assuming each host has a line). That is a challenge when you have to render the graph itself for example. But overall slicing and dicing the data on different contexts does help a lot. So a single graph can almost always never give you the overall picture, it has to be taken in context with other graphs that show the data sliced and diced on different contexts.
Excellent session. I am always amazed by the things I find when I spend time delving more deeply into our own metrics or logs.
John delivered two AMAZING talks last year. I’m super excited to have him back again. I love the theme of this talk. I’ve always encouraged performance engineers to take a few hours digging into an anomaly to figure out “why?”. Even if it’s not what you expected, it’s usually illuminating – as is John.