CDN | CDN Delivery incident details
Incident Report for Gcore
Postmortem

We would like to apologize for any issues this degradation has caused to you or your customers. As you can see below, we performed detailed analysis of the incident and are already working on action items to provide you will the most reliable service in industry.

Issue

Our clients may have experienced  HTTP 500 errors when requesting non cached content for a duration of approx 24-mins.

Impact

~3.3% traffic drop during Nov 30 13:57:00-14:21:26 UTC  due to the issues with serving non-cached content.

Timeline

Nov 29 12:45:39 2021 UTC: new LUA code deployed on one host, no issues spotted

  • the  host has 3 errors in logs: attempt to call method 'gcore_new' (a nil value) в 12:48:51, 12:50:05, 12:50:08
  • An engineer checked logs on host but did not notice these 3 errors in logs

Nov 29 14:03:03 2021 UTC: new LUA code deployed on larger canary env ab-test (13 hosts), no issues detected

  • no manual logs analysis was performed
  • no significant increase in error count on global dashboard noticed by the engineer

Nov 30 13:56:16 2021 UTC: new LUA code deployed on all CDN edge nodes

  • almost instantly we detected enormous increase in HTTP 500 responses

Nov 30 14:11:00 2021 UTC: changes were roll-backed

Nov 30 14:26:30 2021 UTC: count of HTTP 500 responses fell to baseline numbers

Root-cause:

1. CDN edge nodes running cache servers started responding with HTTP 500 for uncached content due to inability to get content from the origin

  • cache server utilises special LUA code to resolve origin source
  • but this LUA code could not have been called in some cases due a to race condition

race condition would likely to arise during deployment of the new code version

  • we were conducting LUA code deployment aimed to decrease a number of DNS requests during a cache server initialisation
  • new version of the LUA code being updated on a disk while the previous version of the code had been already loaded into the workers, this is usually not a problem because
  • all required code is cached by the worker after it gets the first request (lazy initialization)
  • code on a disk is compatible with code loaded by the worker(s)

but some cache server workers have not had this LUA codeloaded because

  • they’ve received a signal to flush a cached code due to worker reload seconds before (reload is an automated routine activity and part of CDN resources reconfiguration process which could be triggered anytime)
  • workers received zero requests to process since reload (so no need to load the code from disk)

mind the timeframe between last reload and first request to serve, during this window code was changed on the disk

  • when worker got first request to server it was trying to load part of the code from disk and failed
  • no code compatibility was preserved
  • cache server responded with http 500 error

2. We did not identify the issue during canary release because the engineer considered error rate as insignificant, compared with the global error rate graph

  • no dedicated dashboard for canary deployments is in place, where it's easy to spot the issue or trigger an alert

3. Lack of backward compatibility for internal code

  • lack of LUA code development/deployment guidance or spotless automation

4. During pre-prod and then canary releases, no LUA code warning was spotted

  • drawback of the current observability tooling

Action points

We will implement some new measures and quality gates, as well as improve current solutions.

get rid of racy condition

  • add require statement to the init_by_lua block which eliminates particular case as no workers would initialize in lazy manner
  • manage lua code from one place/service ( updated code should arrive in one go, using atomic operations )

observability and canary improvements and automation

  • create dedicated canary environment dashboard
  • implement more gradual and safer deployment strategies (10% of the fleet > 30% > 50% > 70%) in current deployment tooling
  • notify deployment engineer in slack about deployment state and provide them with canary environment dashboard link
  • catch/trigger on internal lua alerts ( implement non-blocking LUA VM error monitoring/alerting )

enhance documentation

  • add new deployment tooling readme as well as add link to this postmortem
  • review new engineer onboarding procedure
Posted Dec 03, 2021 - 13:43 UTC

Resolved
This incident has been resolved. Сurrently we have no evidence of an ongoing impact. And considering the incident's severity, we are going to provide you with an RCA report in the following days. We are very sorry for inconvenience caused by the incident and we want to reassure you that we’ll make a maximum effort to prevent such incidents in the future.
Thank you for bearing with us!
Posted Nov 30, 2021 - 15:47 UTC
Monitoring
Our engineering team has implemented a fix to resolve the issue.
We are still monitoring the situation and will post an update as soon as the issue is fully resolved.
We apologize for any inconvenience this may have caused and appreciate your patience.
Posted Nov 30, 2021 - 14:23 UTC
Investigating
We are currently experiencing performance's partial outage.
During this time, service may be partially unavailable for users.
We apologize for any inconvenience this may have caused and will share an update once we have more information.
Posted Nov 30, 2021 - 14:14 UTC
This incident affected: CDN.