CDN | Singapore incident details
Incident Report for Gcore
Postmortem

Issue:
HTTP 5xx errors in multiple DCs

Timeline (UTC):
07.12.2021 12:50 UTC multiple datacenters received a huge amount of established connections per server. Edge servers utilized 100% CPU in multiple locations: Frankfurt,Amsterdam,Hong Kong, Singapore, Tokyo,Osaka,Dakka, Milan,Paris,Los-Angeles and Stockholm . On-call engineer received multiple alerts about high request time, high 5xx rate, and overfilled syn backlog.
07.12.2021 12:56 UTC excess traffic has been rebalanced to different DCs
07.12.2021 13:05 UTC affected CDN edge nodes didn't return back to the normal state after removing excess traffic
07.12.2021 13:06 UTC engineers started restarting the process of the HTTP server on each affected CDN server
07.12.2021 13:40 UTC HTTP server has been restarted on all affected edge servers.
07.12.2021 13:57 UTC connection limit has been applied to the CDN resources which received huge amount of incoming connections

Root-cause:

  1. HTTP server utilized 100% CPU on edge nodes
  2. HTTP server reached the TCP limit of outgoing connections to the same IP address/port
  3. Some CDN resources had no limits for upstream connections

Impact:
Users experienced increased latency and inability to receive content from the affected CDN locations(Frankfurt, Amsterdam, Hong Kong, Singapore, Tokyo, Osaka, Dakka, Milan, Paris, Los Angeles, Stockholm).

Action points:

  • Limit the number of nginx upstream connections on all CDN resources
  • Reimplement internal communication between nodes inside location to reduce number of tcp connections
Posted Dec 20, 2021 - 17:18 UTC

Resolved
The issue has been resolved and we are monitoring performance closely. We are going to provide you with an RCA report within the following incident - https://statuspage.gcorelabs.com/incidents/yrzfcxg8tfzn
We are very sorry for inconvenience caused by the incident!
Thank you for bearing with us!
Posted Dec 08, 2021 - 13:37 UTC
Monitoring
Our engineering team has implemented a fix to resolve the issue.
We are still monitoring the situation and will post an update as soon as the issue is fully resolved.
We apologize for any inconvenience this may have caused and appreciate your patience.
Posted Dec 07, 2021 - 13:48 UTC
Investigating
We are currently experiencing performance's partial outage.
During this time, service may be partially unavailable for users.
We apologize for any inconvenience this may have caused and will share an update once we have more information.
Posted Dec 07, 2021 - 13:02 UTC
This incident affected: CDN (Singapore).