Cloud | Luxembourg-2 | Block Storage SSD Low-Latency - Block Volume incident details
Incident Report for Gcore
Postmortem

RCA – Luxembourg-2 (ED-16), Storage Issue (SSD Low-Latency / Listor)

Summary

On 26-03-2024, a sequence of critical issues was triggered by a failed region test alert, leading to operational disruptions within the cloud environment, primarily involving Linstor storage anomalies. The official incident timeline commenced at 17:40 UTC, with the initial alert, and concluded at 21:51 UTC on 27-03-2024, following comprehensive recovery efforts encompassing node deactivations, manual file system repairs, and cluster synchronization procedures.

Timeline

· [26.03.2024 17:40 UTC]: The Cloud Operations Team received a notification of a failed region test alert.

· [26.03.2024 17:40 – 18:03 UTC]: The Cloud Operations Team investigated the issue, identified the root cause related to Linstor storage, and began implementing a workaround. This workaround involved disabling two storage nodes due to the main issue of disk skew at the LVM level, observed with disks added post-startup. · [26.03.2024 18:40 – 19:18 UTC]: The workaround was applied, and internal tests were rerun successfully. · [26.03.2024 20:31 UTC]: The Cloud Operations Team received a notification of a failed region test alert again.

· [26.03.2024 20:31-20:41 UTC]: The Cloud Operations Team confirmed the issue with Linstor storage and began investigating the root cause and potential solutions. · [26.03.2024 20:41-21:52 UTC]: A significant redundancy load was discovered following the initiation of volume auto-evacuation/replication, with approximately 2000 resources migrating between nodes, adversely affecting the DRBD cluster. · [26.03.2024 23:46 UTC]: The replication process was ongoing, with a lower than expected speed. · [27.03.2024 02:02 UTC]: A major portion of the replication was still underway. It was decided to wait for the completion of the full replication process, estimated to conclude by 5:30 AM, exceeding the initial 3-hour expectation. · [27.03.2024 05:30-06:21 UTC]: Following the completion of replication, the Cloud Team removed all failed volume creation tasks and volumes that were stuck. · [27.03.2024 06:21 UTC]: SSD Low-Latency management was reopened for volume creation. · [27.03.2024 06:52 UTC]: Reports were received of mass failures in instance booting due to file system failures. · [27.03.2024 06:52-14:48 UTC]: A significant increase in failed tasks and volumes was observed, impacting storage performance and overall management. This was compounded by DRBD resources getting stuck in the D-state due to network capacity being overwhelmed by the replication process. · [27.03.2024 14:48-21:51 UTC]: To address the unpredictable storage performance and errors, the Cloud Operations Team decided to sequentially reboot each storage node with full cluster synchronization with an overall maintenance duration of 7 hours. Maintenance included node-by-node reboot, synchronization, and clearing of failed tasks and stuck volumes. · [27.03.2024 21:51 UTC]: The cluster was fully recovered and operational.

Root Cause

Data Skew Due to Disk Volume Variance: The initial problem emerged from data skew caused by varying disk volumes on the first two nodes within the region. This variance led to failures during disk creation attempts, with errors being reported at the Logical Volume Manager (LVM) level.

Node Deactivation for Cluster Modification: To address the issue, two nodes were deactivated within the Linstor system with the intention of removing one node from the cluster and re-adding it after a complete data clearance. During this process, all disks that were concurrently present on nodes 1 and 2 were forced to operate with a single copy of data.

Automatic Linstor Evacuation: An automatic evacuation process within Linstor was triggered on the second node, as per default settings to activate after one hour.

Data Transfer Storm: The evacuation process led to a data transfer storm, consuming all available capacity.

Widespread Impact of the Data Storm: The storm severely impacted operations, including persistent requests to the Distributed Replicated Block Device (DRBD). Many synchronizing resources, approximately 2000, began to stall, as did the replication process itself.

Stabilization: Stabilization occurred as the data replication ended and all DRBD resources were up-state.

Action points

• Standardization of Node Configuration Across Regions: Align the configuration of older nodes to a unified cluster geometry with uniform disk sizes in all regions, as identified in ED and ANX. This action aims to mitigate data skew issues resulting from disk volume variance. The estimated time of accomplishment (ETA) for this action is the second quarter (Q2) of 2024. • Linstor Node Maintenance Instruction Update: Revise and update the Linstor node maintenance instructions to include the deactivation of auto-evacuation and auto-recovery features during maintenance activities. This update should also cover procedures for controlled restoration of cluster redundancy. This measure is intended to prevent the inadvertent triggering of data transfer storms and associated impacts. The ETA for this update is 14 April 2024.

These action points are designed to address the root causes identified in the RCA by enhancing cluster uniformity, refining maintenance protocols to prevent future incidents, and exploring file system alternatives for greater stability.

Finally, we want to apologize for the impact this event caused for you. We know how critical these services are to your customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

Posted Apr 01, 2024 - 17:20 UTC

Resolved
We'd like to inform you that the issue has been resolved, and we are closely monitoring the performance to ensure there are no further disruptions. We will provide a Root Cause Analysis (RCA) report in the coming days to help you understand what caused the incident and the steps we have taken to prevent it from happening again in the future.

We apologize for any inconvenience this may have caused you, and want to thank you for your patience and understanding throughout this process.
Posted Mar 27, 2024 - 23:14 UTC
Monitoring
We are pleased to inform you that our Engineering team has implemented a fix to resolve the issue. However, we are still closely monitoring the situation to ensure stable performance.

We will provide you with an update as soon as we have confirmed that the issue has been completely resolved.
Posted Mar 27, 2024 - 22:15 UTC
Update
We would like to inform you that our Engineering team is actively working on resolving the issue you are experiencing. We understand that this issue may be causing inconvenience, and we apologize for any disruption to your experience. Please know that we are doing everything we can to fix it as soon as possible.
Posted Mar 27, 2024 - 18:56 UTC
Identified
We proceed with sequentially rebooting the storage nodes, the cluster has many dead/stucked/unmanaged volumes and tasks and actually bringing it into operation is only possible with a sequential reboot of each node.
At this time, we will turn off control and sequentially reboot each node. Current disks with a replica will be available to RW during work; creation and management will not be available for 1 hour. We understand that this issue may be causing inconvenience, and we apologize for any disruption to your experience. Please know that we are doing everything we can to fix it as soon as possible.
Posted Mar 27, 2024 - 16:58 UTC
Investigating
We see the issues with existing disks after restoring SSD Low-Latency Storage. Our team working to fully stabilise the storage platform. We apologize for any inconvenience this may cause and appreciate your patience and understanding during this time.

We will provide you with an update as soon as we have more information on the progress of the resolution. Thank you for your understanding and cooperation.
Posted Mar 27, 2024 - 14:37 UTC
This incident affected: Cloud | Luxembourg-2 (Block Storage - Block Volume).