The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and Applications

Dave Mangot (Salesforce.com), Peter Phaal (InMon Corp.)
Operations Mission City B1
Average rating: ****.
(4.33, 12 ratings)

This session should have broad appeal with good background on the technology for beginners and offer a very high degree of familiarity for experts running such applications at scale. If you’ve played with Ganglia gmetric, and are familiar with tcpdump and some scripting language, you can get some amazing metrics and results from sFlow.

Subtopics

Background – Peter will present a history of, and the technical background behind the sFlow standard. He will describe the advantages of the technology in terms of overhead, scalability, data structures, efficiency and performance.

Implementation – Dave and Peter will describe what it takes to roll out host sFlow enabled applications across the data center or the cloud. Peter will discuss how sFlow enabled applications communicate with the collectors. Dave will show examples of rolling out sFlow via Puppet and also how Tagged configured their architecture to be able to feed real time sFlow data into their various Ganglia clusters in a completely automated fashion using both native host sFlow and sflowtool (very much like tcpdump).

Example Architecture

Application level metrics – Dave will discuss Apache, memcached, and Java rollouts of sFlow monitoring at Tagged, a site with 350 million registered users and 5 billion page views a month, and how this is able to give us things like traffic metrics, hot/missed memcached keys, heap utilization, operation timings, and bandwidth utilization on a per protocol basis with absolutely no post processing required.

Conclusion: By utilizing sFlow standard based metric collection, you can quickly and easily develop an extremely detailed and granular dashboard of your network, system, and application performance, with minimal overhead, for free.

Examples of areas to be covered during the talk:

Apache

Apache stats would normally be calculated by processing the output of /server-status/. Using mod_sflow (http://code.google.com/p/mod-sflow/) instead, this data is streamed at any interval we choose, directly into Ganglia.

Realtime Ganglia graph of Apache Requests by Method

If we use the output from sflowtool (like tcpdump) we can even build a table of the top URLs on our site which take the longest (just like network sFlow can show us the top talkers on our network).

HTTP URLs by Duration

We can see in this chart that the ‘upload a photo’ to our social networking site takes the longest amount of time, which makes perfect sense. Maybe we should investigate why two of our ‘Pets’ pages (our most popular game) take so long. In this case SREs can work with Engineering and show them what they’ve discovered and even give them access to the data. A great opportunity for Devops collaboration, and probably for some optimization!

Java

Here we can see a realtime graph of JVM heap allocation vs. utilization. It turns out that we periodically had to restart this app, but no one knew why. With sFlow instrumented Java we can see fine grained detail of heap utilization. Here we can see that the garbage collector is unable to bring the heap on this application back to a steady state and over time, it grows and grows until the machine starts swapping. Then there is the restart and the garbage collection is able to keep up again.

Before host sFlow, Dave used to get these kinds of metrics by parsing the output of jstat (in the JDK) running against every JVM in a polling fashion and feed that information into rrdtool. The other way to solve this problem would be to setup a dedicated JMX poller that could retrieve this information from a designated list of hosts and feed them into Ganglia. This doesn’t scale to any large environment unfortunately, consider Netflix in EC2. When you have potentially thousands of machines disappearing and appearing on the network in minutes, you can only refresh the host list on your poller so often. With sFlow instrumentation of the JVM, all data is pushed to your collectors as often as you like, no polling necessary.

There are a number of other metrics that the sFlow java standard collects like number of open file descriptors, threads, etc. These are good indicators of file activity, network activity, etc. and can also be demonstrated.

Memcached

Memcached has been a black box for systems administrators and developers for many years. There are a limited amount of statistics you can get using the ‘STATS’ interface to memcached, and those statistics are about the memcache instance as a whole, not about individual keys. Some of the STATS commands (like SIZES) will actually lock up the entire cache while it scans the items, leaving your memcached instance unusable for several seconds. With sFlow memcached we get non-invasive granular instrumentation of our memcached instances giving us the ability to see things that could only be done by loading a kernel module like Gear6 Advanced Reporter before. With sFlow memcached, it is a simple open source patch to the memcached source (hopefully included soon), no kernel recompilation required.

As an example here we can see the number of missed key operations per minute (truncated for this example) across all of our memcached hosts.

While we can also get stuff like hot keys, being able to see where most of the misses are can help us find even simple things that can have a major impact like a misspelled key name in our PHP code.

We also have the regular metrics like hit rates that you would get from the STATS command, without having to telnet to the memcached instance every 15 seconds.

Memcached cache efficiency

Other Metrics Host sFlow gives us the ability to look into application metrics like Memcached, Apache HTTPd, and Java. It also gives us the same metrics that we would get with Ganglia’s gmond, without needing the gmond daemon, and as a benefit, with a much more efficient network fingerprint. During the course of the coming months we plan on testing new sFlow instrumented applications like Apache Tomcat and Node.js. These would be great examples to display at a conference in June as well.

Photo of Dave Mangot

Dave Mangot

Salesforce.com

Dave has over 15 years in the field of systems administration. He is a senior systems engineer at Tagged Inc. responsible for monitoring and metrics on the Tagged server farm. Dave developed his interest in metrics working at various ISPs over the years and an appreciation for doing it at scale leading the sysadmin team for the global CDN at Cable and Wireless and working at Terracotta, and Tagged. Most recently he has teamed up with Peter Phaal to validate and enhance the sFlow approach to application monitoring for a variety of applications.

Photo of Peter Phaal

Peter Phaal

InMon Corp.

Peter is the original inventor and a co-author of the sFlow standard. Peter is President and a Founder of InMon Corp., a leading provider of performance analysis software based on sFlow. Before InMon, Peter worked at Hewlett-Packard Laboratories where he created Hewlett-Packard’s Extended RMON technology. His book, LAN Traffic Management, describes the techniques of monitoring and managing traffic on local area networks. Peter has over 20 years experience in the field of network performance monitoring.

Comments on this page are now closed.

Comments

Picture of Sophia DeMartini
Sophia DeMartini
06/28/2012 3:09pm PDT

The slides are now available in PDF format under the ‘Presentation’ link above.

Picture of Peter Phaal
Peter Phaal
06/28/2012 3:04pm PDT

The slides still aren’t on the Velocity site, so I also posted them on slideshare

Picture of Dave Mangot
Dave Mangot
06/27/2012 5:02pm PDT

We submitted them to the O’Reilly folks. They will post the slides. Cheers.

Picture of Ryan Schwartz
Ryan Schwartz
06/27/2012 1:42pm PDT

Dave, would you be willing to share your slide deck for this preso?

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Gloria Lombardo at glombardo@oreilly.com

Media Partner Opportunities

For media partnerships, contact mediapartners@ oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Velocity contacts