Jan 17, 2017

Telemetry Lessons from Waze — The Self-Operating Network™

waze_screenshot.pngThe other day I was taking an Uber home from a dinner in San Francisco and as I got into the car the driver pulled up the Uber Partner Application on one phone to start my ride and then on the other phone he had mounted to the dashboard he added my destination in Waze. This piqued my curiosity — the Uber app automatically receives my destination and enters it into a Google-based mapping system for the driver, so why the co-witnessing?

Fortunately the driver was quite chatty and explained in detail to me over the next fifteen minutes how many passengers expect, and often even demand, that he use Waze. They feel that if the driver is using Waze then they are definitely going to get the fastest routing to their destination and that Waze is a trusted authority, “if Waze says take Franklin street you better not turn right on Van Ness!”

See, Waze introduced another level of telemetry to the GPS routing — social telemetry. They created a closed-loop feedback algorithm where drivers take different routes and the results are measured to update subsequent routing and use it to constantly refine route selection. They even allow individuals to communicate alternative routes via the messaging functions. This saved me countless times driving back from our offices in Menlo Park rushing to catch dinner with my family in San Francisco. Its astute suggestions to use 280 instead of 101, or take the 6th street exit then make a left on Ahern Way rather than continuing straight to the 9th street exit often enabled me to bypass long traffic jams — and helped me catch my daughters before they went to bed!

But what about in the networking world? What capabilities do we have in place for improving the accuracy of network routing and path selection? Today, the main method we use is link failure detection and routing protocol updates. We trust each node to track the state of each individual link connected to it and use methods like Bidirectional Forwarding Detection to determine if the link is working in both directions. When the link fails we then rely upon the routing protocol to communicate to all relevant nodes that the destinations associated with that link are now unreachable.

We have seen solid improvements over the last ten to fifteen years in methods to quickly and accurately detect link failure as well as in improvements to routing scale and convergence times, but we have not added any groundbreaking new methods for link and path failure detection.

Waze introduced a new metric, a new source of accurate telemetry that improves the user’s experience so much so that they have become a source with more authority about path selection than a London Black Cab driver who aced The Knowledge. It’s time for a new set of metrics and telemetry to be introduced that will augment today’s link failure detection model.

At Apstra, we view this closed-loop telemetry gathering as a core part of our system and a requirement in building a Self-Operating Network™. This is enabled through a much more active probing and testing model as well as the inclusion of intent and blueprint modeling into the testing criteria. Let me give you a simple example…

On the left you see a pretty standard, well-designed Leaf/Spine Network with a 4-way spine. The blade server is dual-homed with 4 links, two to each of two leaf switches in an active-active model. Again, pretty standard stuff nowadays in well-designed networks, (although it is amazing how many midsize enterprise customers are still using Spanning Tree and such as a topology construction protocol by artificially inducing STP loops in an attempt to effect resiliency into the network design).

On the right you see an obviously visually malformed network. The cabling team clearly mis-cabled the network, connecting the blade server to one leaf switch with all four links. Visually we can quickly identify that our INTENT of this network is not being realized because of a relatively simple error. The challenge I would proffer though is this: which management tool that you use today would proactively identify this issue? How would you find it? Are there any alerts or alarms generated by the network equipment itself that would indicate that you have a looming issue on your hands the first time you try to upgrade the software/firmware on that leaf switch?

This is where a combination of intent and telemetry comes in. This is a simple example of the challenges that Apstra Operating System (AOS®) solves and the capabilities it brings to the network. Moving past the simple red-light/green-light world and last-generation link-failure methods to one that uses an intelligent closed-loop architecture that meshes intent and the active monitoring of reality. Similar to how Uber drivers are relying on a second source of truth in Waze to deliver a better experience for their clients and get that coveted 5-star rating, we believe that a second source of truth in validating and co-witnessing the execution of intent in network implementation will be required to deliver on the vision of a Self-Operating Network.

Mansour Karam

President, Founder