Hetzner NVMe failing
Heya,
I've got a question, tried researching about it on the interwebz, but it didn't lead anywhere. So, I have quite a bit of servers with Hetzner. Around a week back, I got an alert (from a monitoring software) that 3 separate NVMes on 3 dedicated servers had failed. Since they were running on RAID 1, I scheduled a disk replacement and it was done with. Fyi, the % used on all the drives were less than 20%, 1 was in single digits. Today, I got alerted again, notifying that yet another NVMe (it was the "new" NVMe on the same dedicated server) had failed. I'm posting the SMART stats here if it helps:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.13-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLB512HAJQ-00000
Serial Number: S3W8NX1M954021
Firmware Version: EXA7301Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Utilization: 242,402,029,568 [242 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 8991c42437
Local Time is: Thu Jan 30 14:14:58 2020 UTC
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 81 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.02W - - 0 0 0 0 0 0
1 + 6.30W - - 1 1 1 1 0 0
2 + 3.50W - - 2 2 2 2 0 0
3 - 0.0760W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 36 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 2,827,660 [1.44 TB]
Data Units Written: 1,561,851 [799 GB]
Host Read Commands: 2,721,446,668
Host Write Commands: 67,041,131
Controller Busy Time: 2,403
Power Cycles: 13
Power On Hours: 184
Unsafe Shutdowns: 3
Media and Data Integrity Errors: 1
Error Information Log Entries: 6
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 36 Celsius
Temperature Sensor 2: 45 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 6 2 0x0198 0x4502 0x000 606758440 1 -
Now this, is very very very odd. Among multiple providers I have used around the world, I have never ever had anything like this. A seemingly new NVMe, with less than 200 power on hours fails? Point to note, they were the exact same models that failed. Might be a bad batch..?
If you guys have any thoughts on this, it would be highly appreciated.
Comments
Could be a bad batch of disks but also based on a quick bit of research on that error/status could be a controller based issue on the boards, which frankly given your description sounds more likely.
The disks may be fine.
https://inceptionhosting.com
Please do not use the PM system here for Inception Hosting support issues.
AFAIK, Hetzner uses PCIe raisers/extenders to plug the NVMe drives in, might be smth on there.
could be, I guess if you have had separate identical 3 issues, it is likely that others would have seen the same (if they even noticed) and perhaps they have a wider investigation going on.
https://inceptionhosting.com
Please do not use the PM system here for Inception Hosting support issues.
I've opened a ticket with Hetzner specifically for this (probably should have done that first), I'll update here if I get any useful response from them. Meanwhile, tagging @Hetzner_OL to grab their attention.
Welp, no luck there, it's just Proxmox running, on ext4, if anyone's interested.
What monitoring/alert? If it's Hetrix and you have all checks enabled it will notify you if it's doing anything like checking/resync. What did
cat /proc/mdstat
show? I've never had a problem with Hetzners drives.ExtraVM
It's not Hetrix, I just have a script that checks the SMART stuff. /proc/mdstat was normal, nothing odd on there.
When in doubt, sue.
♻ Amitz day is October 21.
♻ Join Nigh sect by adopting my avatar. Let us spread the joys of the end.
Mary, or baby? I love babysue.
My pronouns are like/subscribe.
By any chance, do you have the same model of NVMe drives? (SAMSUNG MZVLB512HAJQ-00000 ). And here's the proc/mdstat output:
@SagnikS One of my Ryzen 3700X servers has a Toshiba KXG60ZNV1T02, the other has a Samsung MZVLB1T0HALR. i9-9900K has Samsung MZVLB1T0HALR as well. Treadripper MZQLB960HAJR.
So no, not the same ones you have.
ExtraVM
Ah gotcha, they're 1TB or more ig. These are 500GB ones.
Sorry, since I'm not a technician myself and I don't have direct access to the information you've shared with our suppor team, it's a bit difficult to comment on this situation.
I assume that you shared all the information that you could with our team, including the troubleshooting that you've already tried, right? If not, please do that. Maybe there's something else that will turn up. You could also consider writing a post in our custiomer Discussion Forum. If other customers with NVMes have had similar issues, they'll let you know, or they'll give you some other ideas to try out. (Many of our oldest clients are from Germany, which is why there is so much German in this Forum, but most readers speak Engilsh. Just make sure to share what you've already tried out.) --Katie
We're Katie and Lea and we'll do our best to answer questions you have about Hetzner Online. We and not our employer are responsible for any horrible puns and dated cultural references.
We had major issues with this model of the drive, with frequent failures across a large number of the drives (if they work - they work well, but some fail quickly, badly, and early if they don’t). We have since stopped providing new services with this model.
Clouvider Limited - VPS in 11 datacenters - Intel Xeon/AMD Epyc with NVMe and 10G uplink! | Dedicated Servers
Warning warning NVMe warning!!!!
I bench YABS 24/7/365 unless it's a leap year.
The samsung ones?
Similar experience with ours too, glad to know it's an issue with the NVMe itself. Just to confirm, it's this right: MZVLB512HAJQ?
I think that's the ones I had that went bye bye with zero warning iirc?
https://inceptionhosting.com
Please do not use the PM system here for Inception Hosting support issues.
Aye.
Clouvider Limited - VPS in 11 datacenters - Intel Xeon/AMD Epyc with NVMe and 10G uplink! | Dedicated Servers
Are you guys talking about our ex?
♻ Amitz day is October 21.
♻ Join Nigh sect by adopting my avatar. Let us spread the joys of the end.
I really don't know if there's anything to troubleshoot at all when an NVMe fails, and I got this response from support:
I really hate that sort of support.
Roughly translated: "Dear Customer 104582, I have looked in to nothing, I am really just trying to find a reason for this not to be my problem so i can close the ticket"
https://inceptionhosting.com
Please do not use the PM system here for Inception Hosting support issues.
I mean yeah, I narrowed it down to some of them being particularly sensitive to running hot, so if you had a “less resilient” drive, and you hammered it, you would run it hot and then through your own use you’d destroy it, but hey, this wasn’t happening on PM961, only on PM981, so clearly this is not a user caused issue...
Clouvider Limited - VPS in 11 datacenters - Intel Xeon/AMD Epyc with NVMe and 10G uplink! | Dedicated Servers
I guess I have to put in a request to mix manufacturers when configuring nvme mirrored pools.
None of my ex failed so swiftly, badly and early like (allegedly) this NVMe model
when you're younger that's something you should absolutely try I guess
Yep, however, a friend of mine was told that Hetzner doesn't to have anything in stock other than those Samsung NVMes.
Looks like the closest I have is:
2x SAMSUNG MZVLB1T0HALR-00000
The rest are:
KXG60ZNV1T02 TOSHIBA
Looks good so far:
https://clbin.com/u9u2n
I've sure put them through a lot. Uptime of 232 days, don't think I've rebooted this machine much.
Do everything as though everyone you’ll ever know is watching.
Probably something to do with that exact 512GB model.
This doesn’t affect as many 1TB ones, 256 and 512 however are/were a problem
Clouvider Limited - VPS in 11 datacenters - Intel Xeon/AMD Epyc with NVMe and 10G uplink! | Dedicated Servers
He has to use stronger sotfwares.
Hardwares are BIG.
The bigger the computer, the better.
Softwares are strong.
Made by heavy duty programmers.
Efficiency is through the roof.
Heating during the winter from servers.
???
Profit is millions.