Feb 3, 2017

Claw Back the Cost of Network Failure

Network system failureThe network is the underlying foundation of the data center. If that foundation becomes unstable, everything else, apps and all, are affected. The bummer is that most large data center networks are, in fact, unstable.

Complex networks experience rapid entropy and require constant human care. This entropy manifests itself as a lack of network agility and poor network availability. Whatever the reasons networks are not working as desired, the effects are the same; apps and services quickly break, impacting both operations and customers.

Of course, CIOs and IT departments would reduce network entropy and instability if they could. However, they can’t, and here’s why.

The data center network is always in some state of failure

Large data center network operators report a near continuous state of network failure. These may be “gray” network failures within redundant infrastructure so that some services continue. However, they are failures, they are expensive, they are impactful, and they are routinely accepted to a degree which would not be tolerated elsewhere. Given common gray failures, data center networks generally seem to work, but they do not seem to work very well, as Doug Gourlay highlights in this blog “SDN: What it Should Have Been”.

Networks begin their operational lives in a prolonged failure state

When a network is still a pile of non-provisioned parts, it is not operating. And while we cannot expect a network to operate immediately upon purchase, there is a real opportunity cost paid each week it takes to spin it up. The onerous provisioning process and near continual reconfigurations that networks uniquely experience are failure states. We’ve just grown used to them and pretend this is normal.

Planned management operations often instigate unplanned network failures

Management operations which can result in network failures include tasks like altering data center interconnect, upgrading control planes or switch configuration, draining and undraining workloads, etc. Management operations can unexpectedly impair network operations for hours or days. These tasks occur continuously within large data centers and are often accompanied by network failure.

Poor risk assessment and failure planning are important causes of failure

Organizations (should) project, and test the blast radius of network failures associated with all network components. In practice, a failure’s predicted blast radius is often significantly less than its actual blast radius, resulting in unexpected outages or inadequate network capability which adversely affects the business.

After each network failure, degraded operations often leave less network capacity than the applications require. Such inadequate capacity or capability is itself yet another failure. This phenomenon is recursive and its effects are compounding.

The high mean-time-to-insight from network failure

A major indirect cost of network failure is the lengthy, arduous, manual effort required to root-cause mysterious problems (e.g. green light failures). Box-by-box network sleuthing with high value engineers is hard enough, but operators may not know a problem has originated from a network failure so they may not know enough to unleash the engineers.

For example, application performance monitoring systems easily detect simple user experience problems. However, finding their root cause is generally time consuming and difficult. Investigators may resort to creating hand-crafted map-reduce operations against various logs, metrics and events to gain insight. Eventually the underlying problem, finding a needle in the haystack, may emerge and indicate some non-obvious network failure. This mean-time-to-insight associated with data center network failures is very high.

The blast radius of network failure

The direct blast radius of a failure refers to the extent of instability, unexpected behavior, or non-responsiveness of network components. The downstream impact of network failure to applications, departments, customers and the business forms a far more consequential indirect blast radius.

Recouping the cost of failure

The direct monetary cost of network failure as we have described it across the market can be precisely known. It is the cost of managing network entropy and the near continuous state of failure; that is, it’s simply the cost of operating the network (opex). Opex is the cost of the direct blast radius of non-agile, and non-available, manually operated networks. The cost of the indirect blast radius is the opportunity cost of lost or delayed business.

We can measure opex for the entire market from Cisco’s often repeated assertion that 27% of network lifecycle costs are attributable to equipment costs (capex). If this is correct, the annual $11B enterprises spend on data center network products (2016) indicates that enterprises spend an additional $41B (or 73%) getting these networks to work (from some state of failure), and to keep them working. At Apstra, we want to annihilate that $41B cost.

Addressing this $41B problem associated with manually operated networks is Apstra’s raison d’etre. Our solution is to create vendor-agnostic, self-operating networks; networks that configure themselves, and fix themselves. How it works is the subject of another blog.

Dave Butler

Vice President, Business Development