Root Cause Analysis
Location: Atlanta, Georgia, USA
This Document Provides a Root Cause Analysis of a Partial Outage Affecting Some VMware Cloud Customers
Date: Wednesday, April 1, 2020
Atlanta (ATL1) data center
At 00:10 EST on Wednesday, April 1, 2020 our Atlanta data center experienced a partial network outage specifically affecting the ability of some customers to access management functions of their vCenter appliances. This outage lasted for approximately 55 minutes and was due to a faulty high availability (HA) cable preventing failover during planned steps in a server platform upgrade. Engineers manually invoked the failover and full service was restored by 01:05 EST.
In preparation for an upcoming upgrade to advanced new fabric interconnects, configuration updates were pushed to one of the current FIs. Normally, failover would occur without interruption to network traffic; however, a failure in one of the physical HA cables between FIs prevented the failover, which simultaneously prevented authentication of the VPN connection used to remotely remediate the situation, thus requiring onsite intervention. The non-failover caused intermittent network traffic loss to multiple Atlanta vCenter appliances affecting the ability of some customers to perform management functions. As a note, this did not affect running hosts but only access to VCSA.
MacStadium immediately dispatched Engineering to perform the manual failover procedure onsite and network traffic flow immediately restarted. HA cables were then replaced, configuration checked and confirmed. Internal and external alerting quickly confirmed access, and spot checks were run system-wide. Although the issue was isolated to the FI, full cluster checks were conducted.
Going forward, VPN authentication will be performed prior to the use of virtual desktop infrastructure, precluding the possibility of a similar incident occurring in the future. Additionally, the integrity of physical HA links will be inspected at regular intervals. MacStadium is also investigating the implementation of advanced self-healing measures.
MacStadium takes issues resulting in a loss or degradation of service very seriously and we strive to continually understand the root cause so we may improve our level of service for our customers.