Write-up
Bold Checkout not loading / loading slowly
Incident Root Cause Analysis
Incident: Bold Checkout Outage / Degradation

Date and time: Tuesday April 25 10:50 AM CT - 1:05 PM CT \(135 min\)

Summary: Customers using Bold Checkout during the outage window experienced a slow or intermittent availability of service during a cloud hosted routine maintenance event, once the issue was detected a reversion of the maintenance to the original configuration was completed but did not fully resolve the issue. Further investigation was performed and the issue was remediated internally.

 

Impact: During the incident, customers experienced an error or a failure to load Bold Checkout. Additionally some customers may have entered Bold Checkout successfully, but experienced slow / intermittent response times  when placing an order. These issues led to a lower volume of orders than normal.

Root Cause:  Initial investigation discovered an unexpected change in internal routing behavior which resulted in a large portion of network traffic within the Checkout network  to be routed incorrectly to internal downstream resources. Upon roll-back, auto scaling technology was enabled to allow scaling back up of services. However, the services did not recover correctly to allow continuity of service and caused the isolation of computing resources within the network. 

The post-incident investigation into this routing anomaly uncovered a legacy configuration which was unique to this area of Bold’s environment  and was not compatible with updated configurations applied during the maintenance event. Reversion was completed but in turn caused anomalous behavior in the auto scaling technology causing further disruption to services as it tried to return Checkout to normal operation.

Detection: This issue was detected by Bold employees in real-time while routine maintenance was being performed. We also received multiple alerts from our automated alerting / monitoring systems.

Resolution: Upon discovery of the network flow breakdown from Checkout to internal resources we halted and reverted all maintenance work being performed. The reversion of the maintenance failed in this case to fully restore service. Further investigation found that a portion of the traffic was still not being routed correctly and later was found that auto scaling of our services was behaving incorrectly due to the reversion. Once discovered a configuration change was made to quickly correct the services and Checkout services resumed normal operation.