Dec 16, 2016

On Network Blindness

pile_of_legos.jpgApstra recently had the privilege of participating in the Networking Field Day (NFD) #13 event. This is a unique event in our industry and it’s one that we here, at Apstra, have been very much looking forward to. It’s only at NFD that network hardware and software companies have an opportunity to sit across the table from real-world network engineers and talk shop. This time around we told a story about network-blindness and how the Apstra Operating System (AOS®) uniquely addresses this problem.

What is the Network?

Apstra makes network automation software. AOS is a distributed OS that is laid over the devices of your network. You interact with AOS, and based on those interactions it drives the devices under its control. When we started building this network automation software, we had to ask ourselves a very fundamental question: What exactly are we automating? What is the network?

Is the network the devices that are chosen to build it? The interconnections? Protocols? Configurations? Encapsulations? All of these things, it turns out, are just building blocks. In the photo above, we see a pile of of legos. Imagine someone decided to automate lego building. Without the instructions that came with the kit, what is the end goal of that automation? What would it build? What would it validate the arrangement of the blocks against to know if they were arranged correctly? Without this context, the automation wouldn’t accomplish very much.

In the medical world, there is a neurological disorder called “face-blindess.” A person suffering from this disorder can see all the parts of a face, but their mind can not compose a face from these pieces. For these people faces do not exist, and therefore they can not recognize people easily.

Similarly in network automation, without understanding what the intended purpose of the network is, we suffer from “network-blindness.” Without the instructions, we’d only be guessing what the building blocks of the network are collectively intended to do — just as with the legos. The instructions represent the expected state of those building blocks. This is what the network is supposed to be. The network is this expected state. It is the network we intend to build.

Intent-Driven Networking

When we start the process of building a network we gather requirements and ultimately decide on a network topology design. We take into consideration the number and speed of switch ports needed for interconnecting the various network devices. We also consider the number and speed of the ports on the edge of the network, for instance to connect to servers. We have a desired ratio of bandwidth between the edge of the network and the core.

From all of this, we might come up with a network that looks something like this:


At this point, we just have a generic network topology populated with generic representations of devices. We aren’t thinking about the implementation details of the design yet. The generic devices, in turn would have some representation like this:



In most traditional network shops, the network designer would continue from here choosing specific switch models and even going as far as specifying very exact configuration guidelines. The designer, or architect, owns nearly all the choices that go into constructing the network.

This has some serious, very suffocating, effects. First, it becomes very difficult to simply change switch models in a specific portion of our network. For instance, what if you wanted one vendor at the edge of your network, but another in the core? The network designer needs to get involved again, vendors must be evaluated, negotiations occur, procurement is sorted, documentation created, and training for all levels of the network team must happen. That’s no small effort, with much disruption to your day-to-day activities.

A second effect is felt when trying to automate the network. With the network architect owning everything from topology to configuration, automation tends to be restricted to that specific topology, those specific switch models, and these specific configurations. If any of those things change, then the logic of the automation must also be changed. It also means that off-the-shelf automation tools must be customized to fit the choices of the network designer, which invariably ends with disappointment for everyone involved.

There are other side effects to consider, and when combined with the second effect above, an important observation can be made: Even when you choose a single vendor, coupling switch choice and configuration with network design has serious, negative side-effects. It is a major pain point not only for individual network teams, but also the evolution of network operations in general.

The generic topology and devices above represent the network that we intend to build. The Apstra Operating System was designed to capture this intent, by having the user specify the constraints or requirements of the network and then building a representation of the network much like the one above. From this intent, AOS builds a detailed model of what the expected state of the network should be across all those devices, even including the state of the endpoints on the edge.

Decoupling Switch Choice from Network Design

Having this model of the expected state allows us to avoid the suffocating effects of coupling switch choice and configuration with network design. In AOS, the network designer actually creates a generic topology populated with generic devices and the interconnections between them. This becomes a reusable template that can be chosen at the time a specific portion of the network must be built.

The generic devices that a network designer describes to AOS, that are later used in the designing of various topologies, can be mapped to any number of specific switch models. This evaluation and mapping happens independent of the design of the network, and independent of the implementation of any specific network. It is something the network designer does in parallel to and separate from the design of network topologies.

AOS can enable the network designer to do this while avoiding the “least common denominator” problem of common device abstractions. The generic devices a designer describes to AOS are just a description of the ports on a device, including their speed and their roles in the network. For instance, in the generic device above, we have “uplink” and “server” as roles. There is no assumptions made about what should be configured on the devices, other than the configuration of BGP sessions, LLDP, and a handful of other things. If there are specific features a network designer would like to use, these can be added in a number of ways by the user.

Building an Actual Network

At the time a specific network needs to be built, the network engineer will choose an appropriate topology created by the network designer. They can also choose among the various switch models associated with the generic devices that constitute that topology. These choices form a blueprint for an actual network. From this blueprint cabling specs and configurations are automatically generated and can be evaluated before deploying the network.

After racking, cabling, and powering the actual network devices, the engineer will use AOS to bring the network up in stages, with AOS automatically validating the outcome along the way.

First, through Apstra’s open source universal ZTP server, the right version of each device’s network operating system is deployed to the appropriate devices. A sophisticated agent is also deployed that allows the devices to join the Apstra Operating System. These devices will then announce themselves to the AOS servers. They are automatically placed in a quarantined state until the engineer acknowledges them, at which point they are officially joined to AOS as an available resource. Now they can be assigned to the blueprint.

When they are assigned, AOS then enables LLDP on the expected inter-switch links. When LLDP has converged, AOS then notifies the engineer of any interface or cabling issues. How can a network automation platform “know” that a link between two switches is wrong? This goes back to the model of the expected state of the network that AOS generates from the intent of the network designer. This intent is expressed in the creation of the topology design template and the generic devices used to populate that topology.

After the topology is verified the remainder of the configuration is deployed, including any added customizations. AOS now checks for expected routing adjacencies and for expected routes in every device.

The network is now ready for business.

Situational Awareness

All of this validation of the expected state of the network, from interface statuses to LLDP neighbors to routing adjacencies and route tables, happens on a continuous basis, for as long as the network is in operation and under the control of AOS. AOS also continuously validates that the configuration is as expected, including any customizations created through AOS by the user. If a network engineer or operations person logs into a device and changes the configuration, AOS will alert on this deviation.

network_engineer_thinking.jpgWithout AOS, a network engineer might spend a lot of time discovering if, and understanding how, the network is not operating as expected. Frequently a problem in the network is only described to an engineer in vague terms, often with respect to an affected application. “Call quality is choppy,” “these services are intermittently dropping out,” etc.

From this, the engineer must navigate a great sea of interconnected data across network devices and in systems and repositories of information surrounding the network. It is common to waste time going down multiple rabbit holes trying to understand what is happening. The amount of time between being notified of a possible issue to understanding it is called mean-time-to-insight, or MTTI.

AOS is for Network Engineers, by Network Engineers

Through a closed-loop telemetry feedback design, AOS minimizes MTTI greatly by creating situational awareness in the network. When AOS detects that the state of the network is not what is expected, for instance if a cable swap occurs or a BGP session is bouncing, the specifics around this undesired state and its blast radius are shown clearly to the engineer. This means fewer rabbit holes and shorter MTTI. The engineer understands what is going on much sooner and can start planning corrective action.

With AOS, and without touching the CLI, the engineer can swap out any switch with any other switch model compatible with the design of the network, as expressed by the network designer.

All of the things discussed in this post come together to make this possible. This works regardless of the shape of the network. However many network devices are on the edge, however many in the core, however many in a rack, however they are connected with each other and the hosts attached to them. Even if you are swapping out one switch with a switch made by another vendor, this works. And all of this starts with solving the network-blindness problem.

Even if you intend to use one vendor, this capability is an important indication if a network automation solution will be just another expensive obstacle to work around, or if it will really enable and empower you.

You, the Network Engineer.

AOS is a platform built by network engineers, for network engineers. For network engineers at all levels from the “Oh crap! I’m the network person now!” engineer to the “Let me tell you why OSPF is really a distance vector protocol” engineer.

AOS is available and ready for production networks now, but there is still a journey ahead of us. Join us on this journey and become a more effective network engineer while also helping to reshape the future of networking!