Nov 30, 2017

Mitigate Data Center Network Outages Caused by Human Error

80% of network outages are caused by device-level changes made through the CLI.

A recurring theme we hear from our customers is how they want to reduce the kind of data center network errors that come with manually configuring their networks using the device CLI.  From one device to the next, every command entered represents a potential outage.  Indeed, upwards of 80% of network outages are caused by changes made at the device level via the CLI.  The Apstra Operating System (AOS®) can help reduce change-related network outages by semantically validating changes before they are committed to the network.

A Confession

Like many network engineers, I have been the cause of large scale data center outages.  Even network outages that made the newspaper.  If there’s one thing I have learned over the years, it’s that data center networks are fragile.  Worse, the devices that we use to build networks are also fragile.  It’s no secret that the network operating systems (NOS) that drive these devices are usually riddled with bugs.  If you want to see a network engineer get sweaty, just tell him or her to upgrade the NOS in the network.  Finding a stable NOS that supports your needs can be a stressful, outage inducing event.

Bugs aside, as network engineers we have to keep an awful lot of stuff in our head in order to successfully execute a change.  We have mental models of how the network is now, how the network will be, and any intermediate phases in between.  If our models are flawed in any way, or we just make a typo, we could cause an outage.

Personally, I’ve gotten to the point where there are only two possible outcomes when I press the enter key:  (1) The network works is as expected or (2) The network explodes.  Even though I know there are states in between, I have been well seasoned to expect the worst as my right pinky finger descends upon the enter key.  It’s almost as if it has an evil mind all it’s own, just waiting for the right time to strike.

AOS to the rescue!

AOS is an intent-based networking operating system. With AOS, your network is modeled from a reference design.  This design has rules against which changes in the network can be validated before they are committed.

The reference design requires specific features from the various NOS’s that run on network devices.  During development, AOS is tested against various NOS releases to validate that the network will remain stable and operate as expected across the engineering lifecycle.

Lastly, AOS supports Role-Based Access-Control (RBAC) for the various activities it supports. From network design to operations, you can control the way users interact with AOS.  Contrast this with the blanket “enable” or “configure” mode that most organizations default to on the device CLI, giving unlimited access to the engineers making changes to those devices.

Take the Next Step

There’s no question that the future of network implementation and operations will gravitate away from device-level CLI and the risk that comes with it.  Engineers will interface with the network at a higher, more intuitive level to achieve the outcomes that match their intent.  Apstra is leading the way on this journey.