May 2, 2017

Composition @scale with AOS 1.2

network_architecture_640.pngApstra just released AOS® 1.2 which is a major step towards delivering on the vision of an intent-based Self-Operating Network™.

The key features introduced in AOS 1.2 allow network operators and developers to leverage and hone in their programming skills. Whether you’re a network operator learning some programming skills, or a full-fledged developer, AOS 1.2 was built for you; and you’re my primary audience for this blog.

So why is this release so important?

About Intent

To answer that, let’s add some substance to the definition of intent. At the highest level, intent is a declarative specification of the desired outcome (say network connectivity service). It defines “what” not “how”. Fine. While correct, this definition is too high level to trigger a bar fight. Or even a fight at the standards committee meeting or at a meetup (this is Silicon Valley after all).

How about the next level of detail: intent is the single source of truth (regarding the intended state of your infrastructure) that one can programmatically reason about. I will add some detail behind these two phrases that will filter out some noise (and potentially trigger a bar fight). If you are wondering why does one even care about intent-based networking systems and their benefits, this Gartner report provides a nice summary.

Intent As The Single Source Of Truth

Why is the single source of truth important? Why is there a tendency to have multiple sources of truth in the first place? This is because, in its essence, intent is a dynamic community of objects representing business rules, users, apps, policies, inventory, constraints, capabilities, design elements and it is highly variable in nature. Without the single source of truth, you will be spending most of your time in accidental complexity developing a coordination layer that synchronizes a growing number of sources of truth that come with different formats and semantics.

So the first question is, how do you represent this single source of truth? A good representation would be the one that is not overly complicated, can be easily extended, and is easy to reason about. It turns out that a variant of graph-based representation satisfies these requirements and we will cover it in more detail in subsequent sections.

One more note about single source of truth. Today, compliance based on config verification is essentially treating configurations to be the source of the truth. While they certainly are a source of truth, and can be treated as declarative, albeit low-level intent, the biggest problem is that they are very difficult (read: impossible) to reason about when you have a complex system. A bug in the device’s code may expose vulnerabilities even in the presence of a perfectly compliant configuration. A more attractive approach is where compliance is done on the intent and not on the configurations. Indeed, intent-driven telemetry would catch these vulnerabilities as anomalies. For example, assume you had an ACL configured to block traffic. There may be a bug in the code that causes traffic to pass despite this ACL entry. With AOS intent-driven telemetry, an active probe can readily determine this condition, triggering an anomaly.

However, this requires behavior change to take advantage of the opportunity created by the technology.

Reasoning About Intent

Now about that “reasoning about the intent”. The first question is “why does one need to reason about the intent?” In general you need to reason about it in the presence of some change. And this change can come from an operator (business rule change) or from the managed infrastructure (operational status change). Reasoning about intent programmatically enables automation of all aspects of the service lifecycle such as design, build (including resource allocation), configuration rendering, expectation generation, test execution, anomaly detection, troubleshooting, change request validation and refutation.

Here are some reasoning examples that fall under the broader categories of semantic validation, change management, service rendering, and operational validation:

  • Does the device I want to deploy in my reference design have all the capabilities I expect from it to play its role successfully?
  • Have all the resources (such as for example IP addresses, ASNs, VLANs, VNIs, MLAG IDs) been allocated coherently, according to the design? And coherently here may mean “avoid overlaps” in one case, or “leverage reuse” in another.
  • Do I have enough resources to deploy a new virtual network?
    Is adding/removing a device violating some policy? (such as desired oversubscription for example)
  • Can I deploy virtual network endpoints without violating a requirement related to isolation?
  • Can I collect topology information so that I can drain/undrain a device properly in order to minimize the impact of the maintenance operation?
  • What are the expectations for a given device that should be met in order to declare that it plays its role successfully? For example what should its lldp neighbors be, which bgp sessions should be up, what routing table entries should be present, etc.?
  • For a specific device, how do I extract relevant information from the intent about its neighbors and policies so that I can render its configuration properly?
  • Do expectations and collected statuses match? Otherwise, generate an anomaly that contains the metadata to provide the context required to reason about what happened.
  • Is there a need to automatically trigger additional telemetry, driven by some anomaly, thus enabling automated drill-down to aid troubleshooting? For example, you can write a plugin to trigger ping message to a device from its neighbors in response to a lost heartbeat
  • Can I accumulate/aggregate statuses/counters from some elements matching certain criteria in order to reason about the status of the aggregate? For example you can ask: “what percentage of time was the overall traffic on interfaces matching a certain criteria above 80%?”
  • If I ascertain that my system is in bad shape can I roll-back to a version of intent that I liked, marked as such in my “bookmarks” folder?
  • Can I collect proofs that a certain set expectations were met in order to support a failure claim refutation (and get the devops engineer out of jail)?

Programmatic Reasoning About Intent

Now, we all know “everything can be done in software” and so can the reasoning logic described above. In one extreme end, one can write a script that consolidates a set of all sources of truth scattered across various files and databases and then reason about it. And in the absence of change this may be an adequate solution. But “absence of change” is wishful thinking. A bamboo hut on the beach is a beautiful and romantic strategy until the storm hits or you are asked to build the second floor for your in-laws. Or as Mike Tyson said “everyone has a strategy until they get punched in the face.” If you fall down on the ground when hit with the change, don’t call it programmatic reasoning. The recent AWS outage took place because there was no programmatic reasoning about the change that was executed.

What we are talking about here is how to enable reasoning about intent in a maintainable, testable fashion in the presence of change (and punches in the face). Enabling reasoning in a programmable fashion has four components:

  1. You need to be able to extend intent schema without the need for complex refactoring or schema normalisation. This is because the intent schema will change. That is the only constant about it.
  2. You need the capability to easily decompose intent into subsets of elements of interest, each set parsed by its own piece of reasoning logic, along the lines discussed earlier. Let’s call these pieces of reasoning logic reasoning plug-ins. This decomposition is the key to deal with scaling issues. You don’t want every piece of logic reacting to every change in intent.
  3. The reasoning plug-in needs to be notified when a change in its area of interest takes place. This notification should also include the nature of change (addition, update, deletion) and the relevant metadata. This asynchronous, reactive capability (as opposed to polling) is another key to addressing scaling issues as intent gets more complicated.
  4. When reacting to a change and implementing the reasoning logic, the plug-in should have a well-defined API to traverse intent, and be resilient to intent schema change.

AOS 1.2 supports these sophisticated mechanisms out of the box. You, as DevOps engineer, have access to all these features at your fingertips.

In summary, with AOS 1.2 the intent definition language (allowing you to define that single source of truth) AND reasoning about intent is built into the platform.

Why does “built into the platform” matter? Because it means less code (bugs) to write, review, and maintain, less tests to write, review, and maintain. In short, more agility and availability. 

Composition @scale

With this, we address what we believe is a fundamental problem of dealing with composition at scale. (And by “scale” we mean complexities introduced by the size of the infrastructure as well as the size and complexity of new business rules and infrastructure capabilities). To this point, in his blog Peter Levine argues that “Aggregation is the New Virtualization”:

“Instead of divvying the resources of individual servers, large numbers of servers are aggregated into a single warehouse-scale (though still virtual!) “computer” to run highly distributed applications.” 

Dealing with composition at scale is the enabler for this aggregation.

With 1.2, you, do not have to worry about the data persistence layer when a requirement to express a new business rule comes your way and the only thing you need to do is schema definition. APIs become instantly available. And you are able to develop new reasoning plug-ins using a repeatable pattern, resulting in a maintainable, testable code. The developer persona essentially enables the operator persona to speak “declaratively”, through intent specification, while maintaining the full control of the “imperative” part.

You are also able to evolve your reference design as you learn more about it. When you learn about the new vulnerability or something you missed, you are in trouble, but luckily with AOS, only for a short while. With AOS, you can learn how to avoid it and detect it. You can then program this rule by adding new expectations. And this is normal and expected as you will not be able to anticipate all the problems and corner cases. You have the opportunity to insert your expertise. You can even leverage machine learning to define and insert new expectations. AOS doesn’t care where the input for defining expectations comes from. Five years later new technologies will invariably show up, new capabilities will be there, your old best reference design gets obsolete. It is ok. You can define a new one using the AOS platform that you are already familiar with. Your business rules or system capabilities may change but the nature of the composition problem, and your proficiency in programming the changes into AOS does not.

AOS 1.2 Is Built For You

So who are you, who is the “developer persona” in reality? You are Apstra engineers (or future engineers — we are hiring!) developing AOS supported reference designs. You are community contributors collaborating on community developed reference designs. You are devops engineers in your organization developing a secret sauce for your unique requirements. You can focus on inserting expertise into your reference designs. What you don’t need to worry about is how to develop a distributed application that achieves high availability and fault tolerance through some means of coordination and/or consensus, which is notoriously hard to implement correctly and efficiently.

We invite you to try AOS 1.2 and get your feedback. We hope you join our community of developers and look forward to your contributions.


Sasha Ratkovic

CTO, Founder