Postmortem owner | Luke Kavanagh, Manager - Operations |
---|---|
Incident | The Atlanta data center experienced a network outage |
Priority | P1 |
Affected services | All Atlanta-based customer host traffic affected |
During a scheduled maintenance window the Atlanta Data centers experienced a network outage lasting approximately two hours (02:10am EST - 04:05am EST). During a network device code update, the border leaf switch configurations were inadvertently modified by the update which prevented layer-3 customer traffic from flowing. Once the code lines were identified & updated, service was restored.
A Scheduled Maintenance network device code update performed an overwrite of existing network configuration which created a scenario where the Atlanta data center was uncontactable to customer traffic.
All customer layer three traffic was unable to flow inbound/outbound from the Atlanta data center
The network engineer performing the scheduled maintenance detected the network outage while performing the maintenance.
Once the MacStadium Incident Management Team were alerted, they initiated the incident response to conduct a thorough investigation.
With the help of our network vendor partners, the offending network code was identified & updated which allowed traffic to resume normally.
June 20th, 2021
00:01 AM EST - Scheduled maintenance was started by the MacStadium Network Engineering Team.
02:05 AM EST - It was noted by a network engineer that customer traffic had been halted.
02:10 AM EST - The MacStadium Engineering Team performs a rollback of all changes made prior to the halting of traffic, but this unfortunately does not resolve the issue.
02:14 AM EST - The Incident Manager was notified of the issue.
02:24 AM EST - The MacStadium Incident Manager declares an outage & opens an outage bridge for the Incident Team to investigate.
03:07 AM EST - The MacStadium Incident Team contact the network vendor for additional support.
03:08 AM EST- The MacStadium Status page is updated to reflect the outage status.
03:12 AM EST - The network vendor partner joins the outage bridge to help identify the issue.
03:50 AM EST - After numerous troubleshooting steps, the offending code is identified by the vendor.
03:53 AM EST - One of the border leaf switches is updated with new code & this starts to allow traffic to resume.
04:03 AM EST - Service is fully restored, the Incident Team continue to work with the network vendor to identify the root cause.
04:09 AM EST - The MacStadium status page is updated to reflect the restoration of service.
04:44 AM EST - The MacStadium Incident Manager declares the outage is now over & service has been fully restored. The outage bridge is closed.
Problem:
During a scheduled maintenance window, a network device code update halted all inbound/outbound traffic from the Atlanta data center.
Investigation:
Upon a thorough investigation by the MacStadium incidence team along with our network vendor team it was identified that the network code update inadvertently modified our layer three network configurations which prevented customer traffic from flowing.
Mitigation steps:
Improper layer3 configuration was deployed/stretched in a multisite configuration which caused layer3 to be disabled within the evpn/ vxlan fabric after the network switches rebooted regardless of the version of code, thus this was deemed to be human error. The configuration has since been corrected, validated and tested with the network vendor to ensure network traffic continuity in the event of any border leaf network switches rebooting under any circumstances.
MacStadium understands our role as a service provider. We can and do apply all relevant lessons learned from past events as we strive to create reliable remote hands capabilities to our Customers. Our goal is to provide unparalleled service and delivery to our customers' Mac infrastructure needs. We take that responsibility seriously and are constantly working to improve customer experience.