Atlanta Network Outage
Incident Report for MacStadium
Postmortem

Atlanta Network Outage

Date: Thursday March 19, 2020

Prepared by

  • Paul Benati, SVP Operations
  • David Hunter, VP Engineering
  • Khoa Tran, Network Architect & Team Lead - Network Engineering
  • Michael Rhoades, Manager, Service Center

Location

Atlanta (ATL1) data center

Event Description

At 15:05 EST on Thursday, March 19, 2020 our Atlanta data center experienced a network outage. This outage lasted for 41 minutes and was due to a border leaf switch's configuration file inadvertently being modified. The switch's configuration was restored from backup and network service was restored at 15:46 EST.

Chronology of Events / Timeline

  • 2020/03/19 at 15:05 EST - Engineering remotely executed a configuration script inadvertently against one of Atlanta's border leaf switches instead of the desired top of rack switch resulting in a network outage
  • 2020/03/19 at 15:05 EST - VPN and VDI connections down for Atlanta
  • 2020/03/19 at 15:08 EST - Onsite Engineering at our Atlanta DC (ATL1) was engaged to provide access for remote engineers to troubleshoot the outage
  • 2020/03/19 at 15:35 EST - Access was provided to remote engineers
  • 2020/03/19 at 15:35 EST - Remote engineers begin to investigate the outage
  • 2020/03/19 at 15:40 EST - Engineering determines one of the border leaf switch's configuration had been altered
  • 2020/03/19 at 15:46 EST - Engineering restored the original configuration from backup to the affected border leaf switch thereby restoring service

Summary Findings on Technical Causes

The configuration of one of the Atlanta border leaf switches was inadvertently modified. This modification necessarily made the two HA-paired border leaf switches out of sync with each other. Due to the nature of the Cisco HA capability, this partially altered configuration resulted in disabling external connectivity.

Root Causes

  1. The configuration of one of the Atlanta border leaf switches was inadvertently modified

Root cause 1

A script was remotely executed that inadvertently affected one of the two high availability paired border leaf switches . The script partially altered one switch's configuration resulting in disabled external connectivity at 15:05 EST. Network service was restored when the original configuration file was re-introduced from backup.

Reaction, Resolution, and Recommendations

MacStadium has disabled remote execution of certain shell scripting on primary switches. Significant changes can now be made only when directly connected to the switch, precluding the possibility of a similar incident in the future.

MacStadium takes issues resulting in a loss or degradation of service very seriously and we strive to continually understand the root cause so we may improve our level of service for our customers.

Posted Mar 23, 2020 - 13:17 UTC

Resolved
We continued monitoring overnight and did not observe any issues. This incident has been resolved. Thank you again for your patience and understanding.
Posted Mar 20, 2020 - 11:57 UTC
Monitoring
As of approximately 19:50 UTC a fix was implemented and we are monitoring the results. Again, we apologize for the inconvenience and are working to identify the root cause of our Atlanta border leaf systems failure.
Posted Mar 19, 2020 - 20:52 UTC
Identified
Our Engineers have identified the problem and are now working to resolve the incident. Initial reports point to one of our border leafs failing. More details to be provided shortly.
Posted Mar 19, 2020 - 19:50 UTC
Investigating
We are currently experiencing a network outage affecting our Atlanta data center. Engineers are working to identify the cause and remediate as soon as possible. The outage is not affecting any other services (e.g., servers, etc.). We apologize for the inconvenience and will post an ETA as soon as possible.
Posted Mar 19, 2020 - 19:30 UTC
This incident affected: Atlanta Core 1 and Atlanta Core 2.