We would like to apologize for any issues this degradation has caused to you or your customers. As you can see below, we performed detailed analysis of the incident and are already working on action items to provide you will the most reliable service in industry.
Issue
Our clients may have experienced HTTP 500 errors when requesting non cached content for a duration of approx 24-mins.
Impact
~3.3% traffic drop during Nov 30 13:57:00-14:21:26 UTC due to the issues with serving non-cached content.
Timeline
Nov 29 12:45:39 2021 UTC: new LUA code deployed on one host, no issues spotted
- the host has 3 errors in logs:
attempt to call method 'gcore_new' (a nil value)
в 12:48:51, 12:50:05, 12:50:08
- An engineer checked logs on host but did not notice these 3 errors in logs
Nov 29 14:03:03 2021 UTC: new LUA code deployed on larger canary env ab-test (13 hosts), no issues detected
- no manual logs analysis was performed
- no significant increase in error count on global dashboard noticed by the engineer
Nov 30 13:56:16 2021 UTC: new LUA code deployed on all CDN edge nodes
- almost instantly we detected enormous increase in HTTP 500 responses
Nov 30 14:11:00 2021 UTC: changes were roll-backed
Nov 30 14:26:30 2021 UTC: count of HTTP 500 responses fell to baseline numbers
Root-cause:
1. CDN edge nodes running cache servers started responding with HTTP 500 for uncached content due to inability to get content from the origin
- cache server utilises special LUA code to resolve origin source
- but this LUA code could not have been called in some cases due a to race condition
race condition would likely to arise during deployment of the new code version
- we were conducting LUA code deployment aimed to decrease a number of DNS requests during a cache server initialisation
- new version of the LUA code being updated on a disk while the previous version of the code had been already loaded into the workers, this is usually not a problem because
- all required code is cached by the worker after it gets the first request (lazy initialization)
- code on a disk is compatible with code loaded by the worker(s)
but some cache server workers have not had this LUA code loaded because
- they’ve received a signal to flush a cached code due to worker reload seconds before (reload is an automated routine activity and part of CDN resources reconfiguration process which could be triggered anytime)
- workers received zero requests to process since reload (so no need to load the code from disk)
mind the timeframe between last reload and first request to serve, during this window code was changed on the disk
- when worker got first request to server it was trying to load part of the code from disk and failed
- no code compatibility was preserved
- cache server responded with http 500 error
2. We did not identify the issue during canary release because the engineer considered error rate as insignificant, compared with the global error rate graph
- no dedicated dashboard for canary deployments is in place, where it's easy to spot the issue or trigger an alert
3. Lack of backward compatibility for internal code
- lack of LUA code development/deployment guidance or spotless automation
4. During pre-prod and then canary releases, no LUA code warning was spotted
- drawback of the current observability tooling
Action points
We will implement some new measures and quality gates, as well as improve current solutions.
get rid of racy condition
- add require statement to the init_by_lua block which eliminates particular case as no workers would initialize in lazy manner
- manage lua code from one place/service ( updated code should arrive in one go, using atomic operations )
observability and canary improvements and automation
- create dedicated canary environment dashboard
- implement more gradual and safer deployment strategies (10% of the fleet > 30% > 50% > 70%) in current deployment tooling
- notify deployment engineer in slack about deployment state and provide them with canary environment dashboard link
- catch/trigger on internal lua alerts ( implement non-blocking LUA VM error monitoring/alerting )
enhance documentation
- add new deployment tooling readme as well as add link to this postmortem
- review new engineer onboarding procedure