HOST-C, Chat, Updates, Stuff

1141516171820»

Comments

  • host_chost_c Hosting Provider

    Ok, official info was sent to remaining affected customers, please check your e-mails.

    Here is a time-lined description of the event of the fuckup:

    July 27, 2025 – Afternoon (GMT+3):

    Multiple Seagate ST18000NM019J drives (firmware KM02) across two nodes suddenly powered down due to a firmware-related failure. Drives began reporting critical SMART alerts (Data channel impending failure), causing the RAID-6/60 array to become unavailable.

    Result:
    Addon storage volumes became inaccessible, and VPS services depending on those volumes were disrupted. Some NVMe-based systems also experienced write issues due to OS-level I/O buffering.

    July 28, 2025 – Morning:
    Our team accessed the datacenter, identified the fault, and began recovery efforts. All NVMe-only VPS services were successfully migrated to healthy nodes.

    July 28–29, 2025:
    RAID array access was restored in degraded mode, enabling partial access to addon volumes at limited transfer speeds.

    🧪 Root Cause

    Firmware fault affecting multiple ST18000NM019J (KM02) drives simultaneously

    RAID controller entered fault mode due to concurrent SMART failures

    No physical disk damage, no reallocated sectors or ECC errors — this was purely firmware-triggered

    🛡️ Mitigation Going Forward

    We are conducting a full infrastructure audit to identify any remaining ST18000NM019J drives with KM02 firmware

    Affected drives will be proactively replaced or updated, where supported

    RAID monitoring thresholds and firmware validation processes are being tightened to catch these failures earlier

    This was an unprecedented firmware-level failure that bypassed typical RAID fault tolerance. We appreciate your understanding as we finalize recovery efforts for impacted systems.

    Here is an output of one of the drives, maybe it can help others to check theirs if they have the same model used, all 6 reported exactly the same error, have the same powered on hours ( ~266 days ) and were brand new.

    === START OF INFORMATION SECTION ===
    Vendor:               SEAGATE
    Product:              ST18000NM019J
    Revision:             KM02
    Compliance:           SPC-5
    User Capacity:        18,000,207,937,536 bytes [18.0 TB]
    Logical block size:   4096 bytes
    LU is fully provisioned
    Rotation Rate:        7200 rpm
    Form Factor:          3.5 inches
    Logical Unit id:      0x5000c500d8a51a07
    Serial number:        ZR57B8800000G20806CV
    Device type:          disk
    Transport protocol:   SAS (SPL-4)
    Local Time is:        Mon Jul 28 17:36:48 2025 UTC
    SMART support is:     Available - device has SMART capability.
    SMART support is:     Enabled
    Temperature Warning:  Enabled
    
    === START OF READ SMART DATA SECTION ===
    

    SMART Health Status: Data channel impending failure general hard drive failure [asc=5d, ascq=30]

    Grown defects during certification <not available>
    Total blocks reassigned during format <not available>
    Total new blocks reassigned <not available>
    Power on minutes since format <not available>
    Current Drive Temperature:     31 C
    Drive Trip Temperature:        60 C
    
    Accumulated power on time, hours:minutes 6367:42
    Manufactured in week 01 of year 2022
    Specified cycle count over device lifetime:  50000
    Accumulated start-stop cycles:  34
    Specified load-unload count over device lifetime:  600000
    Accumulated load-unload cycles:  291
    Elements in grown defect list: 1
    
    Vendor (Seagate Cache) information
      Blocks sent to initiator = 3828
      Blocks received from initiator = 1650689
      Blocks read from cache and sent to initiator = 9094
      Number of read and write commands whose size <= segment size = 29
      Number of read and write commands whose size > segment size = 0
    
    Vendor (Seagate/Hitachi) factory information
      number of hours powered up = 6367.70
      number of minutes until next internal SMART test = 53
    
    Seagate FARM log supported [try: -l farm]
    
    Error counter log:
               Errors Corrected by           Total   Correction     Gigabytes    Total
                   ECC          rereads/    errors   algorithm      processed    uncorrected
               fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
    read:          0        0         0         0          0          0.016           0
    write:         0        0         0         0          0          6.889           0
    
    Non-medium error count:        0
    
    Pending defect count:0 Pending Defects
    

    The error in bold triggered the detach of the drives from the raid array.

    Here is a screen shot from the log of one of the dells servers ( R740 ) showing 2 drives leaving the " chat" at the precise same time, DST was not set on the server so that is why the time shows only 12:00

    Host-C - VPS & Storage VPS Services – Reliable, Scalable and Fast - AS211462

    "If there is no struggle there is no progress"

Sign In or Register to comment.