Partial Outage Affecting Some Cloud Customers

Incident Report for MacStadium

Postmortem

Partial Outage of vCenter Access

Root Cause Analysis

Date: 04/02/2020
Location: Atlanta, Georgia, USA
This Document Provides a Root Cause Analysis of a Partial Outage Affecting Some VMware Cloud Customers

Date: Wednesday, April 1, 2020

Prepared by

Paul Benati, SVP Operations
David Hunter, VP Engineering
Preston Lasebikan, Systems Architect & Team Lead - Systems Engineering
Michael Rhoades, Manager, Service Center

Location

Atlanta (ATL1) data center

Event Description

At 00:10 EST on Wednesday, April 1, 2020 our Atlanta data center experienced a partial network outage specifically affecting the ability of some customers to access management functions of their vCenter appliances. This outage lasted for approximately 55 minutes and was due to a faulty high availability (HA) cable preventing failover during planned steps in a server platform upgrade. Engineers manually invoked the failover and full service was restored by 01:05 EST.

Chronology of Events / Timeline

2020/04/01 at 00:10 EST - Engineering remotely pushed planned configuration updates to the fabric interconnects (FI) in preparation for the upgrade to an advanced, future-proof server platform.
2020/04/01 at 00:14 EST - Monitoring alerts indicated disjointed intermittent traffic to multiple Atlanta vCenter server appliance (VCSA) hosts causing delays or inability to reach VCSAs externally for some customers.
2020/04/01 at 00:15 EST - Engineering immediately ascertained their VPN authentication was interrupted by the changes and dispatched personnel to intervene onsite.
2020/04/01 at 01:02 EST - Manual intervention onsite successfully initiated FI failover and network traffic was immediately restored. Further inspection revealed the HA heartbeat cables had failed and were subsequently replaced.
2020/04/01 at 01:15 EST - Engineering validated the configuration would prevent any future events by performing failover tests.

Summary Findings on Technical Causes

In preparation for an upcoming upgrade to advanced new fabric interconnects, configuration updates were pushed to one of the current FIs. Normally, failover would occur without interruption to network traffic; however, a failure in one of the physical HA cables between FIs prevented the failover, which simultaneously prevented authentication of the VPN connection used to remotely remediate the situation, thus requiring onsite intervention. The non-failover caused intermittent network traffic loss to multiple Atlanta vCenter appliances affecting the ability of some customers to perform management functions. As a note, this did not affect running hosts but only access to VCSA.

Root Causes

A failed HA cable prevented automatic failover of the fabric interconnect.
Subsequent loss of VPN authentication prevented remote manual initiation of FI failover.

Reaction, Resolution, and Recommendations

MacStadium immediately dispatched Engineering to perform the manual failover procedure onsite and network traffic flow immediately restarted. HA cables were then replaced, configuration checked and confirmed. Internal and external alerting quickly confirmed access, and spot checks were run system-wide. Although the issue was isolated to the FI, full cluster checks were conducted.

Going forward, VPN authentication will be performed prior to the use of virtual desktop infrastructure, precluding the possibility of a similar incident occurring in the future. Additionally, the integrity of physical HA links will be inspected at regular intervals. MacStadium is also investigating the implementation of advanced self-healing measures.

MacStadium takes issues resulting in a loss or degradation of service very seriously and we strive to continually understand the root cause so we may improve our level of service for our customers.

Sponsor Acceptance

Approved by the Project Executive

Date: 2020-04-02
Paul Benati
SVP Operations

Revision History

Posted Apr 02, 2020 - 13:48 UTC

Resolved

This incident has been resolved.

Posted Apr 01, 2020 - 05:50 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 01, 2020 - 05:21 UTC

Update

We are continuing to work on a fix for this issue.

Posted Apr 01, 2020 - 05:19 UTC

Identified

Our onsite Engineers have identified the issue and service is now restored. We will be closely monitoring the environment to ensure all services are fully functioning.

Posted Apr 01, 2020 - 05:02 UTC

Investigating

We are currently investigating an issue where a subset of VMware cloud customers are experiencing failures when attempting to access their vCenter for management purposes. All VMs appear to be running and general production is unaffected.

Posted Apr 01, 2020 - 04:10 UTC

This incident affected: Atlanta Data Center.