A Wholistic Approach to Network Fault and Performance Management

*Disclaimer: I work for a company in the big data space.  None of the comments to follow represent the company I work for or anyone other than my own personal views.

I always do my best to leave product names and company names out of my posts.  First, because I don’t want to misrepresent them, and secondly, because I don’t believe it is critically important what tools are used as long as the proper procedures and policies are in place.  There are definitely players within each area of IT monitoring that outperform their counterparts, but I will leave it to you to determine how you will go about choosing your tools.  What I do want to focus on in this post is my ideology of network monitoring.  When I use the terms network monitoring, I am referring to monitoring of IT assets, network infrastructure, and all systems within the network.  With all of the assets the average organization is responsible for, it makes sense that we would want to choose the best processes and policies for monitoring.  An outage can be expensive, mission-ending, and career-ending.  So, what is the proper approach to monitoring for fault and performance.

Let me start off by saying that this article is not focused on security, however, security in depth might very well follow some of these same ideas with additional focus on monitoring of advanced persistent threats.  Outside of security, what is it that we need to monitor?  That is the real question, and it differs by organization and by organizational maturity.  For instance, an organization with less skill or funding might be more focused on individual hardware failures.  This might simply be because they have no real redundancy built in to their network.  So, a single failure of a device could be the lest granular they can begin with.  For a more mature organization, it would behoove the operators to maintain more focus on the services being offered as opposed to the underlying components, or elements.  This is partly because of the fact that the focus should be on what your customer sees and the services are closer to the customer, but this is also partly because there should be more redundancy in a mature organization meaning that a failure of an element should not create a disruption of service.

There are many avenues of gathering the health and performance of both elements and services.  The bet options allow for a bubble up approach where the information is available from the lowest level to a more concise view of services.  I have often heard people say that SNMP, or simple network management protocol, is the best method for monitoring.  I actually used to agree that SNMP polling, along with unidirectional SNMP traps, was enough and was the best way to get the information needed.  I have since grown in experience and time has showed me the value in other things, such as logs.  System logs provide an invaluable source of information.  One key thing to remember here is that not all information about a piece of hardware, or the traffic it is generating/passing, is available in each method of output.  For instance, there are some things we can learn from the logs only and are not passed through SNMP polls or traps.  This is where the wholistic approach comes into play.  The best strategy makes use of all of the available output options, reducing redundancy when possible.  This can be done by filtering logs or disabling certain SNMP information.

Some would questions whether or not it is healthy for the network, or the performance of the devices, to enable multiple forms of output.  There certainly is reason to be cautious.  The documentation for a lot of systems details the percentage of CPU and memory utilized by various services, such as logging and polling.  This should be reviewed prior to making changes. My main goal in this post is to show that there are many options for monitoring and we should not quickly dismiss those which we are unfamiliar with or uneducated on.  Another benefit of using many sources of data is that sometimes our monitoring tools fail.  What if we loose our SNMP visibility, but we can still receive logs.  Just like we build redundancy into our systems, we should also build redundancy into our monitoring processes. I look forward to your opinions.  Do you think there is a particular source of monitoring data that is useless, such as logs, SNMP, netflow, etc?  Please share your thoughts with our readers.