Nov 3, 2016

Apstra Defines the “Self-Operating Network™”

There has been a lot of well justified press and discussion lately about the rapidly advancing concepts and delivery of self-driving cars. This week, somewhat prophetically, a delivery was made by the first self-driving truck — and it, of course, was a beer delivery.

The analysts and pundits are all pretty excited about the possibilities self-driving cars bring to market: fuel economy, shorter delivery times for goods, lower traffic, more time back for commuters, and not to mention you may not have to drive your son/daughter to that 6am swim practice on Saturday ever again! The main argument, that is repeatedly made, is the argument of safety — simply reducing crashes and fatalities because of human error, distracted driving, etc. This would save over 20,000 lives annually and return up to $400B into the economy — not a bad deal at all.

I was discussing this concept at the open-loop Networking User Group (ONUG) in New York this week with Doug Gourlay and we kept coming back to the idea of taking the operational control of a network to the next levels of autonomy and applying the concepts being learned in the automotive industry to the networking world. We came up with the concept of the ‘self-operating network,’ — really a pragmatic evolution in SDN thinking and focus.

At Apstra we have been focusing on delivering a Self-Operating Network. A Self-Operating Network is a distributed connectivity system that is capable of sensing its environment and dynamically configuring itself to maintain connectivity and ongoing business operations without human input. We envision a world where networks really can configure themselves, fix themselves, and defend themselves.

A self-driving car detects its surroundings using radar, GPS, and synthetic vision. Similarly an autonomous network can apply multiple telemetry systems and capabilities to determine nodal availability, capacity, usage, path availability, congestion, packet loss, misconfiguration, and even cybersecurity compromise or malicious activity. This concept of using multiple co-witnessed systems to establish closed-loop telemetry is critical to operating in adverse weather conditions for automobiles.

The same will be required to deal with networks — not just when they are stable but also when they are changing and under duress.

The automotive industry evolution towards self-driving cars happened in phases through a process of continuous improvement. Initially, the user experience was entirely manual. Then we had the automatic transmission. Then we had basic cruise-control — it would maintain a speed, but would only measure speed to close the loop, hence would not be able to detect if traffic is slowing. Then we had laser-assisted cruise-control, which detect distance to objects in front of the vehicle and adjust speed accordingly. We then had lane tracking, which sense the lane the vehicle drives in, and adjust steering so that the vehicle stays within this lane. Along the way, anti-lock brakes automated the process of applying brakes in case of an emergency, while traction control insured wheels didn’t spin out of control upon acceleration. Ultimately, the industry is delivering self-driving cars, the culmination of all those innovations.

In 2013 the US National Highway Safety Administration released a framework for categorizing and communicating the level of autonomy that a particular vehicle is achieving. This is a guideline for focused evolution as well as a potential regulatory framework to provide guidance to adjacent industries such as insurance and law enforcement about how to deal with this disruptive technology.

Similar to the automotive industry, the evolution towards a self-operating network will happen in phases, through a process of continuous improvement. A similar framework can be established to increase the awareness of what a Self Operating Network really can do:


Phase 0 – Manual

Operations of networks is truly manual: box by box, CLI driven. Change operations are also made manually by logging into every box and issuing CLI commands. Telemetry is non-existent. Debugging is done manually and involves basic tools such as traceroute and ping. Operators are required to login to various devices, gather and correlate information manually. This phase has limited capacity for multi-vendor networks. Most networks in the 1980s and 90s were operated in this fashion.

Phase 1 – Basic

Basic telemetry is introduced, and often involves the traditional Simple Network Management Protocol (SNMP) and Syslog. Configuration generation is still done box-by-box, and is in some cases semi-automated using basic tools PERL and Expect scripts.

Phase 2 – open-loop-Loop

open-loop-loop configuration management is introduced, which involves self built tools to automate the generation of configuration at scale, without operational validation. Separate open-loop-loop telemetry systems gather raw telemetry and present them to the end user in various forms. In some cases, a central repository is introduced for configuration and telemetry collection; changes are managed through this central repository. Some implementations provide a well-designed structure for central control, which provide some basic modeling capabilities. These capabilities help with supporting multi-vendor configuration management.

Phase 3 – Intent-Driven

The critical component of Level 3 systems is intent-driven closed-loop automation — whereby intent represents configurations in a form that can be correlated to the telemetry, and whereby the system’s desired state is continuously compared to the actual state to insure that intent is being satisfied. All the components of a closed-loop system are implemented and integrated: intent/policy layer, resource management, device lifecycle management, high resolution collection of pertinent telemetry, real-time continuous analytics of telemetry to flag anomalies in closed-loop fashion, ability to identify the root cause and blast radius of anomalies. The intent layer is the foundational component of such a closed-loop system, and distinguishes it at its core from a Phase 2 open-loop loop system. The intent layer also enables a natural vendor-agnostic approach. Intent-driven systems are programmable and extensible, so they are customizable to various customer environments. State is stored in a distributed data store and can be streamed to 3rd party systems.

Phase 4 – Self-Operating

Phase 4 systems include self-diagnosing and self-remediating capabilities — they figure out what is wrong and automatically fix the problem or ask for a specific operator fix. With self-diagnosing and self-remediating capabilities, Phase 4 networks become self-operating.

Today most enterprise networks are at “Basic” (Phase 1), most advanced carriers and webscale/cloud providers are moving aggressively towards an “open-loop Loop” (Phase 2) deployment. Some smaller and medium businesses are still at “Manual” (Phase 0).  What we are delivering at Apstra today is an Intent-Driven system that enables full Phase 3 functionality for specific use cases: Leaf-Spine Networks and Container Networking; and our roadmap is accelerating our customers towards “Intent-Driven” (Phase 3) functionality across a large number of use cases in the data center. Ultimately, our goal is to deliver Phase 4 Self-Operating Networks.

What can this mean for businesses in the network context? Similar to the automobile industry, we can envision a world with far fewer outages, less down time, more responsive infrastructure that can automatically respond to sensed changes in the upper-layer application environments.

It’s a journey, and not an overnight destination, but we look forward to sharing our journey with you.

Mansour Karam

President, Founder