Dec 1, 2016

SDN: What It Should Have Been

doug_gourlay_bucketlist_1.jpgFor the last three months, I have been having a pretty good time checking off bucket list items — I’ve learned to kiteboard in Cape Hatteras, biked the Normandy and Brittany coastlines while visiting WWII battlefield sites, rode Icelandic horses across a volcanic plain, and most recently trekked across Nepal to Everest Basecamp. I also got pretty good at Battlefield 1, Titanfall 2, and Destiny, am still working through Dishonored 2 which is a good timesink — all in all, a pretty productive quarter.

While consistently burning my kite into the water at a high rate of speed and plodding step after step up over 18,000 feet I had a lot of time for reflection without a lot of distractions. One of the main points I kept coming back to was ‘Where did we go wrong with Software Defined Networking?’

It is not like SDN was a total failure, but certainly, the reality has not lived up to the hype and expectations many of the pundits and marketers predicted and projected. I feel my personal and oftentimes employer/W2-influenced viewpoint was a bit more moderate than the strongest advocates — this is probably best evidenced in my rather heavily commented posts on NetworkWorld where I enjoyed some spirited debates with Art Fewell and a few others. I simply did not feel that we were hitting the mark with Openflow and that most conversation about Network Virtualization was really more about Transport Virtualization and YATF — ‘Yet Another Tagging Format’. Let me break down a few thoughts here on some of the key technology trends we are seeing for posterity sake…

Network Virtualization

See, to me network virtualization as most vendors have described it is not really network virtualization as it is not virtualizing the network infrastructure. In traditional compute virtualization we have the hypervisor layer that abstracts the hardware from the OS enabling multiple OSes to run on top of a single x86 CPU complex. This abstraction enables workload portability, vendor portability, new ways of effecting backup, and so on. So when I think of ‘what is network virtualization’ I spend my time envisioning a model where, similar to compute virtualization, we abstract the underlying hardware from the software and applications/configuration and not just relabel a VPN or encapsulation method.

When we define network virtualization as ‘the ability to run multiple networks on top of one physical network/fabric’ it does not provide any differentiation between VLANs, available since 1993, MPLS VPNs, and L3 enabled encapsulation models such as VXLAN offers.

Transport Virtualization

What many vendors attempt to call Network Virtualization, to me, is really Transport Virtualization – it is the abstraction of the addressing from the end-point by using an encapsulation method. Generally, nowadays, this has become VXLAN and is not one of the other variants VMware and Microsoft promoted for a while. This enables me to represent a single network, even a traditional L3 routed one, as multiple networks with overlapping IP addresses and can emulate an L2 environment over the L3 one. Easiest analogy: ATM LANE Emulated LANs… (sorry, I know that hurt, but the architectural parallels with the LECS and LES/BUS are pretty obvious.)



OpenFlow is an interesting technology still searching for the meaning of its life and what problem it is going to solve: Field of Dreams — Build it They Will Come. There was a point in time where it was regarded as a massive game-changer and veritable second coming. The reality of the way that it works though is that it is very low-level programming semantic and there are not many people in corporate IT organizations that have enough computer science background, free time and budget, or a realistic return-on-investment thesis that makes it worthwhile for them to attempt to write their own routing protocols and to control their infrastructure flow-by-flow in a way that is markedly differentiated from what I can accomplish with a traditional routing protocol and some fancy, albeit sometimes painful, configuration via PBR.

OpenFlow became a technology that was interesting to professors, network researchers, and a few appliance hardware manufacturers but didn’t solve problems for IT professionals.

Net-net: It’s a useful construct for hardware developers to potentially accelerating their deployment of new systems that do packet forwarding/manipulation, especially in the security space — but I do not see the widespread ‘this changes everything’ outcome prophesied.

NFV: Network Functions Virtualization

Almost the exact opposite of Openflow from a genesis/origin story: NFV started with a real use case and then started developing provisioning and deployment models that enabled the outcome they desired. I think in many cases this will be part of the future of edge routing systems — it is easy to envision a merchant-silicon enabled router with a very cost-effective price-per-port and capacity with large tables, flexible buffering, and low power footprint operating with a service shelf that is based on x86 and NPU/FPGA accelerated data paths that deploy containers on a per tenant basis with the combination of containers being deployed based on the services that tenant has subscribed for.

The concept of per-tenant containerization of network services enables not only a flexible allocation of capabilities on a per-tenant basis but also asymmetric upgrades where Tenant-1 can be running on v4 of an HA Proxy while Tenant-2 has upgraded to v5 of the same component. I think the company that builds the service shelf, exposes some decent libraries for the NPU/FPGA acceleration functions, and most importantly, controls service deployment and component lifecycle/dependency trees in an operator-friendly way will have something compelling here.

Containers Containers Containers!

All this talk of NFC using Containers brings me to another point of interest/worry — Containers themselves. Up until 2005 network switches were primarily designed for bare-metal workloads. The advent of x86 virtualization at some reasonable amount of scale broke the architectures of most switches we built prior to then. Going from one IP/MAC per host to 16-20 and then moving the live workload from one physical host to another broke every fundamental principle of network design we had espoused from 1995-2005. It was a swift kick in the nether regions for network architects and caused significant upgrade cycles and the adoption of challenging operating models that are still being sifted and sorted.

But now we have containers… if you thought 20 MAC/IP pairs were a lot let’s wait for 250-500 workloads per host with lifetimes of the workload as short as 2-3 seconds and see how the network reacts. Current generations of switching silicon from major vendors is capable of handling and supporting this density from a hardware perspective — but what I really worry about is whether today’s manual and human-driven network architecture and operations will properly implement address summarization and will the selected routing protocols themselves be capable of handling both the scale of addressing and the rate of change this induces into the underlying infrastructure.


Containers also exacerbate another area I have spent too much time noodling on — policy. For the record, I am not a policy wonk and while I have had some practical experience in network/cybersecurity I would not and still do not think of myself as ‘a security guy’. I like it and care for it, but there are others who are far more ‘in that space’ than I am. Fundamentally I find that the value of a tool that creates, enforces, and/or monitors policies is proportional to how broadly the policy can be deployed.

I have seen companies run containers with one container per-VM so they had a point they could apply policy — this seem ludicrous until you ask why they are going with containerized architectures and their answer is, “because then we can hire the hottest developers out of college.” I have also seen companies avoid containers altogether because they could not yet apply the level of governance and control to that environment as they feel they need to in order to deploy into production.

There is an industry gap here that needs to be filled, we will have to see what companies and technologies emerge that provide for consistent policy across bare metal servers, network security appliances, virtual machines, containers, and cloud-deployed workloads and services (both IaaS and SaaS). This will be a fun space to watch but will have a lot of confusion and jargon are thrown in.

LAN vs DC vs WAN

doug_gourlay_bucket_list_3.jpgSo one thing we learned with SDN is that location matters. Many folks rotated heavily onto the thesis that SDN was most applicable in the Data Center and would be adopted there first. I think the reality is that economics factor heavily into the decision of whether or not to apply technology to a specific set of challenges. Certainly there were many interesting applications being discussed that added value to the operations of a data center, but the reality is that many of the problems SDN was focusing on were too often capable of being solved with the judicious application of more bandwidth. As much as we love the concept of steering and controlling flows and applications across the network to give them automatic load-balancing across divergent paths in the network (and every other neat application) the simple reality is that Nx10Gb, 40Gb, and 100Gb solves most/all of these traffic optimization issues without anywhere near the complexity these fancy forwarding constructs impose. I have come to believe that in most cases simplicity will win.

What is needed in the Data Center is not fancy per-flow gee-whiz science fiction, but acceptance that a scalable and reliable data center is an expression of simplicity. My data center diagram should never look like a Jackson Pollock masterpiece — it should be boring, repetitive, and stable. The primary business value delivered by a data center network is the availability of applications and data, optimizations should be aligned with continuing to improve this metric.

Sitting here at the AWS RE:Invent show the cloud is the next frontier for the enterprise. There is sufficient gravity associated with the installed base of applications that would require refactoring to be cloud deployable that wholesale enterprise adoption of the cloud is still a decade-long journey and not an overnight deployment, but when 32,000 people show up at the Amazon show you need to pay attention — there is a reason they are all here, 2x the attendees we get at VMworld or CiscoLive. The atmosphere feels more like an Apple MacWorld from 2011 than the deep geekfest that it is.

What is amazing is the economy of scale that a large cloud provider or modern ‘cloudy’ network operator can achieve versus the status quo we are used to. Every large cloud operator and modern operator I have worked with has built a custom and proprietary system for infrastructure management that, when designed and implemented well, does not even allow a network operator to enter CLI commands on switches and routers in their plant. As these systems evolve and mature they abstract the vagaries of vendor implementations and normalize configurations and capabilities so that vendor selection becomes a business and not a technical decision. These systems force normalization and the use of repeatable patterns in infrastructure deployment and then can search and index across hundreds of thousands of data sources for anomalies. These systems not only are not available in the enterprise, they are not commercially available in any capacity. They simply do not exist unless you have built your own.

By contrast I shared a stage with Bill Smith, the President of Technical Operations at AT&T this past summer where he received more than a few heated comments from a savvy buy-side analyst giving him grief about focusing on compressing his vendor’s margins while he has almost 100,000 employees in operations jobs that are often automated in more modern network infrastructures. It is fair to assume that AT&T does not have a similar system for infrastructure management as Amazon, Google, Facebook, or Microsoft Azure does. (In all fairness AT&T has a lot more legacy infrastructure to manage, and disparate network types and architectures as well, but also seems to take a perverse joy in operating 25-100 disparate network operating system version in production.)


In one of the neatest meetings I ever had with Andy Bechtolsheim I spent most of an hour on the whiteboard showing off an idea where in the CLI of a switch we could describe a network outcome with three lines of configuration.

The type of network it was for instance: Layer-3 Leaf/Spine named DC1
The type of node it was, for instance: Compute Leaf
The connectivity pattern it should adhere to: 4x10Gb Uplinks to 4x Spine Switches

This information would then get inserted as a set of TLVs into LLDP advertisements and thus for each network we could evaluate if the CLI driven configuration resulted in an operating node and network that achieved the intent of the network. In short it was a simple model for describing intent and evaluating if intent was delivered and quickly identifying inconsistencies in multi-nodal or multi-planar configuration.

While it was an interesting conversation and a stunning ‘whiteboard/PowerPoint prototype’ it sadly never saw the light of a production day because most infrastructures at that point were multi-vendor and frankly at a certain point in any business customer-driven priorities always prevail over novel science projects, even reasonably well thought through ones.

What this made me realize though is that the same models we have successfully applied to software development where test plans are created orthogonally to code and code is rapidly iterated until it passes automated testing systems could and should be applied to infrastructure — especially if we are going to truly think of ‘Infrastructure as Code’.

TLDR; I continue to believe there is a need here, that we should not just use the configuration of the device itself as the single source of truth but instead define the intent of what we want and then measure and monitor the configuration AND the cabling AND the routing AND the real-time telemetry to determine if the aggregate of these factors is achieving my desired intent or not.

Skills Gap

Lastly, I kept struggling with the concept of an emerging and ever-widening skills gap in networking. At the top end, we have the network engineers who understand scale and programmatic device control. These engineers are almost network programmers or infrastructure developers more than they fit the traditional definition of network engineers. In many cases they never touch a piece of equipment physically nor through SSH, they interact through programs they write in Python, Go, and other high-level languages.

On the other end we have operators who are bound to their keyboards and know the intricacies of each vendor’s CLI like the back of their hand and often have multiple vendor-sponsored acronyms after their names on their business cards. ( I should know, I have been one of these guys for twenty-odd years.) On this end of the spectrum, an engineer’s knowledge of a vendor’s CLI is the primary driver of vendor selection for their employer.

The Network Developers are out-earning the Network Engineers by 2-4x depending on the employer, the tools the Network Developers are using to achieve scale are not commercially available. The Network Developer often manages 100x the network scale of the Network Operator so the salary disparity is clearly understandable.

So why am I writing this and why am I writing it on Apstra’s blog?

I first connected with Apstra in September and saw an overview of what they were building and it hit me on several levels. It solves real business challenges in IT for network engineers, infrastructure managers, and CIOs. It enables multi-vendor infrastructures to be deployed and operated without skills dependencies on individuals within IT — it expands rather than contracts the labor pool for key network roles. I like that it brings the concepts of intent, abstraction, and closed-loop telemetry to the network and can be the best commercially available path to scaling infrastructure management and delivering a Self-Operating Network.

The executive and founding team at Apstra is well known to me and asked me if I would join them as a Strategic Advisor helping them in product/market fit, sales scaling, marketing messaging, and key customer identification. This is, simply, an offer I could not refuse, to get to be part of building something I believe in, in a space I care about, with people that are passionate about bringing change to an industry that is ripe for it.

In short, we spent years chasing a dream with SDN, one that in its first phase was not realized. This team is building what SDN should have been.

Douglas Gourlay