May 28, 2019

Underlay Fabric Health Check for VMware vCenter

network_healthcheckWhen operating VMware vSphere® which is comprised of VMware ESXi™ and typically VMware vCenter®, there are certain requirements expected out of the physical underlay fabric. Most notable are:

  1. VLAN configurations for the ESXi facing ports on the top-of-the-rack (a.k.a. ToR) leafs
  2. Port-channel/LAG/LACP configurations
  3. MTU configuration

Provisioning these fabric configurations may have been done manually or using basic automation (via scripting or programming) to the extent possible. However these requirements are also affected by many Day 2 operations such as bringing up/down new applications, clusters, racks, integrating infrastructure from recent acquisition, etc. Some or all of these Day 2 operations may have been done via scripting. But, the encapsulation within a single source of truth and continuous closed-loop validation of these requirements is rarely done, yet this is crucial to reflect the accurate health of the infrastructure and to significantly eliminate and reduce the cost of service outages.

The unique challenge to perform this validation lies in the roles and responsibilities of the teams and the software systems involved. Typically, in large enterprises, different teams own the virtualization stack and the physical underlay fabric. Similarly there are different products and tools that manage these two worlds individually. Yet, these teams and systems must collaborate and work in a coordinated fashion for a successful end-to-end deployment with optimal performance.

Apstra Operating System (AOS®), tackles these challenges with a unique intent-based approach. AOS makes the fabric highly programmable in terms of not only provisioning, but also monitoring and ongoing operations. The programmability is delivered with intent based RESTful APIs that operate on the network as a whole. For example, provisioning of a common VLAN on all the racks is possible with a single REST API call. These APIs can be consumed by higher level provisioning or operational workflows to enable holistic automation.

Using AOS Virtual Infrastructure Manager feature, VMware vCenter server can be registered as a ‘Virtual Infra Manager’ in AOS. Registered virtual infra managers are added to specific AOS blueprint(s) representing the fabric which interconnects the hypervisors. AOS reads relevant parts of the virtual inventory (i.e., Hypervisors, virtual machines, virtual switches) and stores the inventory as part of the overall operational blueprint. AOS correlates individual hypervisors to server nodes in the blueprint, thereby extending the visibility into the virtual workloads running within the servers.

This unified representation of the underlay and virtual infrastructure elements in the Apstra graph datastore opens doors for compelling new use cases that have been out of reach until now. For ease of troubleshooting, the discovered hypervisors, virtual machines and virtual networking elements are visualized in the context of the physical fabric topology. Virtual machines are searchable by their names, IP addresses, etc. For automated monitoring of VMware vCenter requirements, AOS includes predefined Intent Based Analytics probes that visualize, monitor and alert on the aforementioned cross-cutting requirements:

  1. The VLAN mismatch probe organizes all VLAN related requirements by the ToR port and report anomalies if the fabric is missing any VLAN(s), has extra VLAN(s), misconfigured traffic type (tagged vs. untagged), or double-tagged traffic
  2. The LAG mismatch probe reports on misconfigurations of LAG related config between the two systems. For example, a port-channel is configured on the ToR but no LAG is configured on the vCenter side or vice versa

One way to think of this is that AOS is automatically extracting intent based on the higher level application centric intent expressed in VMware vCenter by the virtualization administrators. AOS takes this a step further and offers on-demand remediation of anomalies it identifies! The remediation is done by automatically changing the fabric to meet all the needs of the virtualization stack. These features are available and shipping since the past few GA releases of AOS.

The same approach lays the foundation for many more interesting and insightful analytics such as:

  1. Detecting MTU misconfiguration
  2. Validating presence of correct default gateways (a.k.a. SVIs) on the correct ToRs
  3. Validating different types of management traffic are segmented appropriately
  4. Ensuring redundancy and effective link utilization at different levels

In conclusion, the fabric requirements of VMware vCenter can be effectively met (and automated) by AOS for both provisioning and day 2 operations. In addition, with intent based, closed loop and real-time validation, the fabric requirements are always monitored against VMware vCenter intent. This kind of independent validation at the boundaries of complex infrastructure products is truly critical for effective operations and agility.