Cloud | Luxembourg-2 | Block Storage SSD Low-Latency - Block Volume incident details
Incident Report for Gcore
Postmortem

RCA – Luxembourg-2 (ED-16), Storage Issue (SSD Low-Latency / Listor)

Summary

On 26-03-2024, a sequence of critical issues was triggered by a failed region test alert, leading to operational disruptions within the cloud environment, primarily involving Linstor storage anomalies. The official incident timeline commenced at 17:40 UTC, with the initial alert, and concluded at 21:51 UTC on 27-03-2024, following comprehensive recovery efforts encompassing node deactivations, manual file system repairs, and cluster synchronization procedures.

Timeline

· [26.03.2024 17:40 UTC]: The Cloud Operations Team received a notification of a failed region test alert. · [26.03.2024 17:40 – 18:03 UTC]: The Cloud Operations Team investigated the issue, identified the root cause related to Linstor storage, and began implementing a workaround. This workaround involved disabling two storage nodes due to the main issue of disk skew at the LVM level, observed with disks added post-startup. · [26.03.2024 18:40 – 19:18 UTC]: The workaround was applied, and internal tests were rerun successfully. · [26.03.2024 20:31 UTC]: The Cloud Operations Team received a notification of a failed region test alert again. · [26.03.2024 20:31-20:41 UTC]: The Cloud Operations Team confirmed the issue with Linstor storage and began investigating the root cause and potential solutions. · [26.03.2024 20:41-21:52 UTC]: A significant redundancy load was discovered following the initiation of volume auto-evacuation/replication, with approximately 2000 resources migrating between nodes, adversely affecting the DRBD cluster. · [26.03.2024 23:46 UTC]: The replication process was ongoing, with a lower than expected speed. · [27.03.2024 02:02 UTC]: A major portion of the replication was still underway. It was decided to wait for the completion of the full replication process, estimated to conclude by 5:30 AM, exceeding the initial 3-hour expectation. · [27.03.2024 05:30-06:21 UTC]: Following the completion of replication, the Cloud Team removed all failed volume creation tasks and volumes that were stuck. · [27.03.2024 06:21 UTC]: SSD Low-Latency management was reopened for volume creation. · [27.03.2024 06:52 UTC]: Reports were received of mass failures in instance booting due to file system failures. · [27.03.2024 06:52-14:48 UTC]: A significant increase in failed tasks and volumes was observed, impacting storage performance and overall management. This was compounded by DRBD resources getting stuck in the D-state due to network capacity being overwhelmed by the replication process. · [27.03.2024 14:48-21:51 UTC]: To address the unpredictable storage performance and errors, the Cloud Operations Team decided to sequentially reboot each storage node with full cluster synchronization with an overall maintenance duration of 7 hours. Maintenance included node-by-node reboot, synchronization, and clearing of failed tasks and stuck volumes. · [27.03.2024 21:51 UTC]: The cluster was fully recovered and operational.

Root Cause

Data Skew Due to Disk Volume Variance: The initial problem emerged from data skew caused by varying disk volumes on the first two nodes within the region. This variance led to failures during disk creation attempts, with errors being reported at the Logical Volume Manager (LVM) level.

Node Deactivation for Cluster Modification: To address the issue, two nodes were deactivated within the Linstor system with the intention of removing one node from the cluster and re-adding it after a complete data clearance. During this process, all disks that were concurrently present on nodes 1 and 2 were forced to operate with a single copy of data.

Automatic Linstor Evacuation: An automatic evacuation process within Linstor was triggered on the second node, as per default settings to activate after one hour.

Data Transfer Storm: The evacuation process led to a data transfer storm, consuming all available capacity.

Widespread Impact of the Data Storm: The storm severely impacted operations, including persistent requests to the Distributed Replicated Block Device (DRBD). Many synchronizing resources, approximately 2000, began to stall, as did the replication process itself.

Stabilization: Stabilization occurred as the data replication ended and all DRBD resources were up-state.

Action points

• Standardization of Node Configuration Across Regions: Align the configuration of older nodes to a unified cluster geometry with uniform disk sizes in all regions, as identified in ED and ANX. This action aims to mitigate data skew issues resulting from disk volume variance. The estimated time of accomplishment (ETA) for this action is the second quarter (Q2) of 2024. • Linstor Node Maintenance Instruction Update: Revise and update the Linstor node maintenance instructions to include the deactivation of auto-evacuation and auto-recovery features during maintenance activities. This update should also cover procedures for controlled restoration of cluster redundancy. This measure is intended to prevent the inadvertent triggering of data transfer storms and associated impacts. The ETA for this update is 14 April 2024.

These action points are designed to address the root causes identified in the RCA by enhancing cluster uniformity, refining maintenance protocols to prevent future incidents, and exploring file system alternatives for greater stability.

Finally, we want to apologize for the impact this event caused for you. We know how critical these services are to your customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

Posted Apr 18, 2024 - 21:51 UTC

Resolved
We'd like to inform you that the issue has been resolved, and we are closely monitoring the performance to ensure there are no further disruptions. We will provide a Root Cause Analysis (RCA) report in the coming days to help you understand what caused the incident and the steps we have taken to prevent it from happening again in the future.

We apologize for any inconvenience this may have caused you, and want to thank you for your patience and understanding throughout this process.
Posted Mar 27, 2024 - 07:57 UTC
Monitoring
We are pleased to inform you that our Engineering team has implemented a fix to resolve the issue. We checked most volumes and attachments to the virtual machines. Currently users should be able to create and manage low-latency volumes as usual. However, we are still closely monitoring the situation to ensure stable performance.

We will provide you with an update as soon as we have confirmed that the issue has been completely resolved.
Posted Mar 27, 2024 - 06:33 UTC
Update
Team is still working to restore current accessibility. We apologize for any inconvenience this may have caused you and appreciate your patience while we work on the fix.
Posted Mar 27, 2024 - 02:56 UTC
Update
Team is working to restore current accessibility. We apologize for any inconvenience this may have caused you and appreciate your patience while we work on the fix.
Posted Mar 27, 2024 - 00:39 UTC
Update
Replication still in progress. We apologize for any inconvenience this may have caused you and appreciate your patience while we work on the fix.
Posted Mar 26, 2024 - 23:45 UTC
Update
Replication in progress, ETA: 1.5 Hours. We apologize for any inconvenience this may have caused you and appreciate your patience while we work on the fix.
Posted Mar 26, 2024 - 22:36 UTC
Update
Replication ETA is 3 hours to restore current volumes. We apologize for any inconvenience this may have caused you and appreciate your patience while we work on the fix.
Posted Mar 26, 2024 - 21:56 UTC
Identified
We want to assure you that our Engineering team is aware of the issue you are experiencing, related to storage replication after hardware failure and is actively working to resolve it. We apologize for any inconvenience this may have caused you and appreciate your patience while we work on the fix.
Posted Mar 26, 2024 - 21:52 UTC
Investigating
We are currently experiencing a partial outage in our service's performance, which may result in partial unavailability for users. We apologize for any inconvenience this may cause and appreciate your patience and understanding during this time.

We will provide you with an update as soon as we have more information on the progress of the resolution. Thank you for your understanding and cooperation.
Posted Mar 26, 2024 - 20:34 UTC
This incident affected: Cloud | Luxembourg-2 (Block Storage - Block Volume).