Gcore Status - CDN | CDN Delivery incident details

CDN | CDN Delivery incident details

Incident Report for Gcore

Postmortem

We would like to apologize for any issues this degradation has caused to you or your customers. As you can see below, we performed detailed analysis of the incident and are already working on action items to provide you will the most reliable service in industry.

Issue

Our clients may have experienced HTTP 500 errors when requesting non cached content for a duration of approx 24-mins.

Impact

~3.3% traffic drop during Nov 30 13:57:00-14:21:26 UTC  due to the issues with serving non-cached content.

Timeline

Nov 29 12:45:39 2021 UTC: new LUA code deployed on one host, no issues spotted

the  host has 3 errors in logs: attempt to call method 'gcore_new' (a nil value) в 12:48:51, 12:50:05, 12:50:08
An engineer checked logs on host but did not notice these 3 errors in logs

Nov 29 14:03:03 2021 UTC: new LUA code deployed on larger canary env ab-test (13 hosts), no issues detected

no manual logs analysis was performed
no significant increase in error count on global dashboard noticed by the engineer

Nov 30 13:56:16 2021 UTC: new LUA code deployed on all CDN edge nodes

almost instantly we detected enormous increase in HTTP 500 responses

Nov 30 14:11:00 2021 UTC: changes were roll-backed

Nov 30 14:26:30 2021 UTC: count of HTTP 500 responses fell to baseline numbers

Root-cause:

1. CDN edge nodes running cache servers started responding with HTTP 500 for uncached content due to inability to get content from the origin

cache server utilises special LUA code to resolve origin source
but this LUA code could not have been called in some cases due a to race condition

race condition would likely to arise during deployment of the new code version

we were conducting LUA code deployment aimed to decrease a number of DNS requests during a cache server initialisation
new version of the LUA code being updated on a disk while the previous version of the code had been already loaded into the workers, this is usually not a problem because
all required code is cached by the worker after it gets the first request (lazy initialization)
code on a disk is compatible with code loaded by the worker(s)

but some cache server workers have not had this LUA code loaded because

they’ve received a signal to flush a cached code due to worker reload seconds before (reload is an automated routine activity and part of CDN resources reconfiguration process which could be triggered anytime)
workers received zero requests to process since reload (so no need to load the code from disk)

mind the timeframe between last reload and first request to serve, during this window code was changed on the disk

when worker got first request to server it was trying to load part of the code from disk and failed
no code compatibility was preserved
cache server responded with http 500 error

2. We did not identify the issue during canary release because the engineer considered error rate as insignificant, compared with the global error rate graph

no dedicated dashboard for canary deployments is in place, where it's easy to spot the issue or trigger an alert

3. Lack of backward compatibility for internal code

lack of LUA code development/deployment guidance or spotless automation

4. During pre-prod and then canary releases, no LUA code warning was spotted

drawback of the current observability tooling

Action points

We will implement some new measures and quality gates, as well as improve current solutions.

get rid of racy condition

add require statement to the init_by_lua block which eliminates particular case as no workers would initialize in lazy manner
manage lua code from one place/service ( updated code should arrive in one go, using atomic operations )

observability and canary improvements and automation

create dedicated canary environment dashboard
implement more gradual and safer deployment strategies (10% of the fleet > 30% > 50% > 70%) in current deployment tooling
notify deployment engineer in slack about deployment state and provide them with canary environment dashboard link
catch/trigger on internal lua alerts ( implement non-blocking LUA VM error monitoring/alerting )

enhance documentation

add new deployment tooling readme as well as add link to this postmortem
review new engineer onboarding procedure

Posted Dec 03, 2021 - 13:43 UTC

Resolved

This incident has been resolved. Сurrently we have no evidence of an ongoing impact. And considering the incident's severity, we are going to provide you with an RCA report in the following days. We are very sorry for inconvenience caused by the incident and we want to reassure you that we’ll make a maximum effort to prevent such incidents in the future.
Thank you for bearing with us!

Posted Nov 30, 2021 - 15:47 UTC

Monitoring

Our engineering team has implemented a fix to resolve the issue.
We are still monitoring the situation and will post an update as soon as the issue is fully resolved.
We apologize for any inconvenience this may have caused and appreciate your patience.

Posted Nov 30, 2021 - 14:23 UTC

Investigating

We are currently experiencing performance's partial outage.
During this time, service may be partially unavailable for users.
We apologize for any inconvenience this may have caused and will share an update once we have more information.

Posted Nov 30, 2021 - 14:14 UTC

This incident affected: CDN.