HOST-C, Chat, Updates, Stuff

host_c · July 29

Ok, official info was sent to remaining affected customers, please check your e-mails.

Here is a time-lined description of the event of the fuckup:

July 27, 2025 – Afternoon (GMT+3):

Multiple Seagate ST18000NM019J drives (firmware KM02) across two nodes suddenly powered down due to a firmware-related failure. Drives began reporting critical SMART alerts (Data channel impending failure), causing the RAID-6/60 array to become unavailable.

Result:
Addon storage volumes became inaccessible, and VPS services depending on those volumes were disrupted. Some NVMe-based systems also experienced write issues due to OS-level I/O buffering.

July 28, 2025 – Morning:
Our team accessed the datacenter, identified the fault, and began recovery efforts. All NVMe-only VPS services were successfully migrated to healthy nodes.

July 28–29, 2025:
RAID array access was restored in degraded mode, enabling partial access to addon volumes at limited transfer speeds.

🧪 Root Cause

Firmware fault affecting multiple ST18000NM019J (KM02) drives simultaneously

RAID controller entered fault mode due to concurrent SMART failures

No physical disk damage, no reallocated sectors or ECC errors — this was purely firmware-triggered

🛡️ Mitigation Going Forward

We are conducting a full infrastructure audit to identify any remaining ST18000NM019J drives with KM02 firmware

Affected drives will be proactively replaced or updated, where supported

RAID monitoring thresholds and firmware validation processes are being tightened to catch these failures earlier

This was an unprecedented firmware-level failure that bypassed typical RAID fault tolerance. We appreciate your understanding as we finalize recovery efforts for impacted systems.

Here is an output of one of the drives, maybe it can help others to check theirs if they have the same model used, all 6 reported exactly the same error, have the same powered on hours ( ~266 days ) and were brand new.

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST18000NM019J
Revision:             KM02
Compliance:           SPC-5
User Capacity:        18,000,207,937,536 bytes [18.0 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500d8a51a07
Serial number:        ZR57B8800000G20806CV
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Mon Jul 28 17:36:48 2025 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===

SMART Health Status: Data channel impending failure general hard drive failure [asc=5d, ascq=30]

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     31 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 6367:42
Manufactured in week 01 of year 2022
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  34
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  291
Elements in grown defect list: 1

Vendor (Seagate Cache) information
  Blocks sent to initiator = 3828
  Blocks received from initiator = 1650689
  Blocks read from cache and sent to initiator = 9094
  Number of read and write commands whose size <= segment size = 29
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 6367.70
  number of minutes until next internal SMART test = 53

Seagate FARM log supported [try: -l farm]

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.016           0
write:         0        0         0         0          0          6.889           0

Non-medium error count:        0

Pending defect count:0 Pending Defects

The error in bold triggered the detach of the drives from the raid array.

Here is a screen shot from the log of one of the dells servers ( R740 ) showing 2 drives leaving the " chat" at the precise same time, DST was not set on the server so that is why the time shows only 12:00

Freek · October 25

Hope all is well with @host_c (Last Active: September 17). Just renewed my awesome 5TB VPS for another quarter

bingobangobongo · October 25

Surely those euro summer vacations are over by now?!? 😬🤷‍♂️

dartagnan · October 26

@bingobangobongo said:
Surely those euro summer vacations are over by now?!? 😬🤷‍♂️

I hope so

AuroraZero · October 26

They are busy upgrading and fixing things. The team is working on it and I am sure they will be ready for the next few months soon.

HOST-C, Chat, Updates, Stuff

Comments