Nov 24, 2016

Data Center Network Monitoring: Moving Beyond Box-by-Box and Packet-by-Packet

Fiber optic server technologyThere is a big difference between the network operations center monitoring tools used for monitoring a box versus monitoring a complex system.

Boxes are relatively easy to instrument using simple telemetry; logs and streaming metrics. Telemetry is monitored and problems are detected, which engineers then fix.

Applications and other complex systems, by contrast, are difficult to monitor. They require more than the observation of logs, metrics, and events. Their complex logic is instrumented. Application operators must be concerned with the intent of the application developer. Monitoring involves the continuous comparison of what actually happens in the system to what the developer intended to happen.

Advanced analytics combines intent-driven complex monitoring of applications correlated with metrics from compute/storage/network boxes and observed events (such as a user session dropped). In this way, engineers may find a needle in the haystack — the root cause underlying the observed event.

Monitoring networks is hard

Things go awry when working with networks. A network equipment designer will develop metrics to provide a network engineer insight regarding the behavior of a specific switch or router. However, this box-by-box insight may be insufficient to determine the root cause of complex, yet very common problems. For example, networking metrics might convey whether an instance of BGP is operating as intended on a router. It may not, however, convey a multi-system misconfiguration error spanning a network.

Network design and configuration is as complex as that of an application, yet is still being monitored with much poorer instrumentation at the system level. As networks are deployed and changed, instrumentation must be developed and changed with the evolving configuration. A network architect traditionally has no means to inject the ability to collect custom metrics and expose deviation from intent to a network engineer during the network’s lifecycle. Network switches and routers are well-instrumented as boxes, but networks as a system are rarely instrumented at all.

Adding network instrumentation to monitor intent of the architect

Sophisticated Linux based network operating systems (from Cisco, Arista, Cumulus, Juniper, Mellanox, etc.) may be remotely configured to instrument behavior at any time via an API. Using this push-based instrumentation, we can allow network engineers and architects to design the metrics that reflect their operational intent — each time a network is deployed or modified. Apstra is delivering such a capability in Apstra Operating System (AOS®), a multi-vendor distributed operating system for the network.

Dave Butler

Vice President, Business Development