Ok, official info was sent to remaining affected customers, please check your e-mails.
Here is a time-lined description of the event of the fuckup:
July 27, 2025 – Afternoon (GMT+3):
Multiple Seagate ST18000NM019J drives (firmware KM02) across two nodes suddenly powered down due to a firmware-related failure. Drives began reporting critical SMART alerts (Data channel impending failure), causing the RAID-6/60 array to become unavailable.
Result:
Addon storage volumes became inaccessible, and VPS services depending on those volumes were disrupted. Some NVMe-based systems also experienced write issues due to OS-level I/O buffering.
July 28, 2025 – Morning:
Our team accessed the datacenter, identified the fault, and began recovery efforts. All NVMe-only VPS services were successfully migrated to healthy nodes.
July 28–29, 2025:
RAID array access was restored in degraded mode, enabling partial access to addon volumes at limited transfer speeds.
RAID controller entered fault mode due to concurrent SMART failures
No physical disk damage, no reallocated sectors or ECC errors — this was purely firmware-triggered
🛡️ Mitigation Going Forward
We are conducting a full infrastructure audit to identify any remaining ST18000NM019J drives with KM02 firmware
Affected drives will be proactively replaced or updated, where supported
RAID monitoring thresholds and firmware validation processes are being tightened to catch these failures earlier
This was an unprecedented firmware-level failure that bypassed typical RAID fault tolerance. We appreciate your understanding as we finalize recovery efforts for impacted systems.
Here is an output of one of the drives, maybe it can help others to check theirs if they have the same model used, all 6 reported exactly the same error, have the same powered on hours ( ~266 days ) and were brand new.
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST18000NM019J
Revision: KM02
Compliance: SPC-5
User Capacity: 18,000,207,937,536 bytes [18.0 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500d8a51a07
Serial number: ZR57B8800000G20806CV
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Mon Jul 28 17:36:48 2025 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: Data channel impending failure general hard drive failure [asc=5d, ascq=30]
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature: 31 C
Drive Trip Temperature: 60 C
Accumulated power on time, hours:minutes 6367:42
Manufactured in week 01 of year 2022
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 34
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 291
Elements in grown defect list: 1
Vendor (Seagate Cache) information
Blocks sent to initiator = 3828
Blocks received from initiator = 1650689
Blocks read from cache and sent to initiator = 9094
Number of read and write commands whose size <= segment size = 29
Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 6367.70
number of minutes until next internal SMART test = 53
Seagate FARM log supported [try: -l farm]
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 0.016 0
write: 0 0 0 0 0 6.889 0
Non-medium error count: 0
Pending defect count:0 Pending Defects
The error in bold triggered the detach of the drives from the raid array.
Here is a screen shot from the log of one of the dells servers ( R740 ) showing 2 drives leaving the " chat" at the precise same time, DST was not set on the server so that is why the time shows only 12:00
Comments
Ok, official info was sent to remaining affected customers, please check your e-mails.
Here is a time-lined description of the event of the fuckup:
July 27, 2025 – Afternoon (GMT+3):
Multiple Seagate ST18000NM019J drives (firmware KM02) across two nodes suddenly powered down due to a firmware-related failure. Drives began reporting critical SMART alerts (Data channel impending failure), causing the RAID-6/60 array to become unavailable.
Result:
Addon storage volumes became inaccessible, and VPS services depending on those volumes were disrupted. Some NVMe-based systems also experienced write issues due to OS-level I/O buffering.
July 28, 2025 – Morning:
Our team accessed the datacenter, identified the fault, and began recovery efforts. All NVMe-only VPS services were successfully migrated to healthy nodes.
July 28–29, 2025:
RAID array access was restored in degraded mode, enabling partial access to addon volumes at limited transfer speeds.
🧪 Root Cause
Firmware fault affecting multiple ST18000NM019J (KM02) drives simultaneously
RAID controller entered fault mode due to concurrent SMART failures
No physical disk damage, no reallocated sectors or ECC errors — this was purely firmware-triggered
🛡️ Mitigation Going Forward
We are conducting a full infrastructure audit to identify any remaining ST18000NM019J drives with KM02 firmware
Affected drives will be proactively replaced or updated, where supported
RAID monitoring thresholds and firmware validation processes are being tightened to catch these failures earlier
This was an unprecedented firmware-level failure that bypassed typical RAID fault tolerance. We appreciate your understanding as we finalize recovery efforts for impacted systems.
Here is an output of one of the drives, maybe it can help others to check theirs if they have the same model used, all 6 reported exactly the same error, have the same powered on hours ( ~266 days ) and were brand new.
SMART Health Status: Data channel impending failure general hard drive failure [asc=5d, ascq=30]
The error in bold triggered the detach of the drives from the raid array.
Here is a screen shot from the log of one of the dells servers ( R740 ) showing 2 drives leaving the " chat" at the precise same time, DST was not set on the server so that is why the time shows only 12:00
Host-C - VPS & Storage VPS Services – Reliable, Scalable and Fast - AS211462
"If there is no struggle there is no progress"