Atlanta Data Center - Extended maintenance
Incident Report for MacStadium
Postmortem
Postmortem owner Luke Kavanagh, Manager - Operations
Incident The Atlanta data center experienced a network outage
Priority P1
Affected services All Atlanta-based customer host traffic affected

Executive summary

During a scheduled maintenance window the Atlanta Data centers experienced a network outage lasting approximately two hours (02:10am EST - 04:05am EST). During a network device code update, the border leaf switch configurations were inadvertently modified by the update which prevented layer-3 customer traffic from flowing. Once the code lines were identified & updated, service was restored.

Postmortem report

Fault

A Scheduled Maintenance network device code update performed an overwrite of existing network configuration which created a scenario where the Atlanta data center was uncontactable to customer traffic.

Impact

All customer layer three traffic was unable to flow inbound/outbound from the Atlanta data center

Detection

The network engineer performing the scheduled maintenance detected the network outage while performing the maintenance.

Response

Once the MacStadium Incident Management Team were alerted, they initiated the incident response to conduct a thorough investigation.

Recovery

With the help of our network vendor partners, the offending network code was identified & updated which allowed traffic to resume normally.

Timeline

June 20th, 2021

00:01 AM EST - Scheduled maintenance was started by the MacStadium Network Engineering Team.

02:05 AM EST - It was noted by a network engineer that customer traffic had been halted.

02:10 AM EST - The MacStadium Engineering Team performs a rollback of all changes made prior to the halting of traffic, but this unfortunately does not resolve the issue.

02:14 AM EST - The Incident Manager was notified of the issue.

02:24 AM EST - The MacStadium Incident Manager declares an outage & opens an outage bridge for the Incident Team to investigate.

03:07 AM EST - The MacStadium Incident Team contact the network vendor for additional support.

03:08 AM EST- The MacStadium Status page is updated to reflect the outage status.

03:12 AM EST - The network vendor partner joins the outage bridge to help identify the issue.

03:50 AM EST - After numerous troubleshooting steps, the offending code is identified by the vendor.

03:53 AM EST - One of the border leaf switches is updated with new code & this starts to allow traffic to resume.

04:03 AM EST - Service is fully restored, the Incident Team continue to work with the network vendor to identify the root cause.

04:09 AM EST - The MacStadium status page is updated to reflect the restoration of service.

04:44 AM EST - The MacStadium Incident Manager declares the outage is now over & service has been fully restored. The outage bridge is closed.

Root Cause Analysis

Problem:

During a scheduled maintenance window, a network device code update halted all inbound/outbound traffic from the Atlanta data center.

Investigation:

Upon a thorough investigation by the MacStadium incidence team along with our network vendor team it was identified that the network code update inadvertently modified our layer three network configurations which prevented customer traffic from flowing.

Mitigation steps:

Improper layer3 configuration was deployed/stretched in a multisite configuration which caused layer3 to be disabled within the evpn/ vxlan fabric after the network switches rebooted regardless of the version of code, thus this was deemed to be human error. The configuration has since been corrected, validated and tested with the network vendor to ensure network traffic continuity in the event of any border leaf network switches rebooting under any circumstances.

Follow-up tasks

MacStadium understands our role as a service provider. We can and do apply all relevant lessons learned from past events as we strive to create reliable remote hands capabilities to our Customers. Our goal is to provide unparalleled service and delivery to our customers' Mac infrastructure needs. We take that responsibility seriously and are constantly working to improve customer experience.

Posted Jun 22, 2021 - 23:57 UTC

Resolved
The Engineering Teams have identified and corrected a configuration issue. We will continue to monitor the network closely.
Posted Jun 20, 2021 - 11:02 UTC
Monitoring
Services remain fully operational with no ongoing concerns. We will continue to monitor our systems closely.
Posted Jun 20, 2021 - 08:50 UTC
Identified
The Engineering Teams have identified and corrected a configuration issue, and normal service has been restored to the Atlanta Data Center at this time. We are continuing our monitoring of the situation to ensure ongoing stability.
Posted Jun 20, 2021 - 07:55 UTC
Investigating
Scheduled maintenance for today (Reference ID: CHANGE - 97) must be extended beyond the original maintenance window scheduled to end at 7:00 AM UTC. The Atlanta Data Center is currently losing packets at this time, and the MacStadium Engineering Team has engaged with the vendor Engineering Support Team.
Posted Jun 20, 2021 - 07:01 UTC
This incident affected: Atlanta Data Center.