Do you think we will see a return to SSD/NVMe cached HDD in the LE* world?

AnthonySmith · February 28

Well I remember paying over $1000 for a 1TB NVMe not THAT long ago, market pressure is set to go way higher than it was when 1TB was a big NVMe so.... Interesting times ahead.

host_c · February 28

@havoc

I have played with 2x 24 SSD ( 128 to 256 GB ) in DELL MD1220 + Optane or NVME as chache, 256 GB RAM, high freq CPU and whatever you can imagine you can tweak in zfs. Used Debian, Ubuntu, Freenas, TrueNAS Core/Scale.

same for 12 or 24HDD in DELL MD1200.

Sincerely, total waste of time.

The second you hit it with random HIGH IO from VM's over 2x10GBPS or 2x25GBPS it will utterly suck.
Storage exported via ISCSI was much faster then NFS, but I did not fall of my chair performance wise.

Other the the ability to extend a pool on the fly by replacing drives with larger ones and the strong data integrity ZFS gives you, I personally see no real value in it for our use case.

/2c after burning far too many hours in tests.

ZizzyDizzyMC · February 28

@host_c said:
@havoc

I have played with 2x 24 SSD ( 128 to 256 GB ) in DELL MD1220 + Optane or NVME as chache, 256 GB RAM, high freq CPU and whatever you can imagine you can tweak in zfs. Used Debian, Ubuntu, Freenas, TrueNAS Core/Scale.

same for 12 or 24HDD in DELL MD1200.

Sincerely, total waste of time.

The second you hit it with random HIGH IO from VM's over 2x10GBPS or 2x25GBPS it will utterly suck.
Storage exported via ISCSI was much faster then NFS, but I did not fall of my chair performance wise.

Other the the ability to extend a pool on the fly by replacing drives with larger ones and the strong data integrity ZFS gives you, I personally see no real value in it for our use case.

/2c after burning far too many hours in tests.

I spent about 4 weeks tuning and exporting storage over 40g from truenas using iscsi and the performance in proxmox was always abysmal, switched to nfs export and got 7gbps throughput (up from 1.3) and a 5x increase in available iops.

Any idea what I did wrong?

havoc · February 28

@host_c said:

Sincerely, total waste of time.

You do seem better equipped than me on hardware and I don't doubt that you're in a better position to test this, but I think you may have missed the plot here

Point of my post was that throughput doesn't measure real experience. So if you hit me with "The second you hit it with random HIGH IO" then I think you didn't read my post

host_c · February 28

@ZizzyDizzyMC said: I spent about 4 weeks tuning and exporting storage over 40g from truenas using iscsi and the performance in proxmox was always abysmal, switched to nfs export and got 7gbps throughput (up from 1.3) and a 5x increase in available iops.

Any idea what I did wrong?

ISCSI writes by default are synchronous/SYNC, you most probably mounted NFS ASYNC, hence the 3-7X difference, that slow performance over ISCSI is the real performance of your raid array sincerely ( I presume 4 disks? ) , ASYNC is RAM CACHED, SYNC is flushed to the disk in real time + ACK that it got written back to the node - much much slower yet much much safer integrity wise in case a power outage / kernel stalls on both ends.

iSCSI (sync) = disk performance - the hard core truth
NFS (async) = RAM performance - the nice multi GBPS you wish to see, but it is a lye

NFS can also be SYNC, in that case it will behave lice ISCSI SYNC, yet most use it ASYNC.

Also, Promxox ISCSI implementation is a bit ...... not the best in my view compared to vMware or Microsoft ( no offense folks, it is the truth )

Atention!!!!!!!!!!!!
Proxmox does NOT have a clustered filesystem like VMFS5/6.
iSCSI is block access, like physically having those disks in the server you mounted them on. ( well not disks, exported LUN )
It is NOT file-system-aware!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
If you mount the same LUN on 2 servers, bingo, they will write data on each-other's blocks. There are work-arrounds like GFS2 yet, it hardly worth the effort.

I would personally avoid NFS for VM storage as well!!!!!

Why:
Proxmox mounts NFS persistent, if the NFS share becomes unavailable (NAS reboot, switch reload, VLAN issue, usual stuff that can happen), the mount can enter D state. - you are fuked, and this is a node reboot fix only as the kernel will wait indefinitely.

Over time, the node may start misbehaving (hung tasks, blocked I/O, GUI lag, usual shit-show )

Yes, that persistence is part of NFS consistency model and that is a very good thing, but operationally it can be painful, and once you start hitting it a few times and your cluster crashes because the ISO repository is not accessible anymore, you will wipe it out completely from your config, we did.

ISCSI was designed for shared storage to phisical server, that is why you even have ISCSI built in the lan card on servers so you can boot the OS from ISCSI while having 0 drives in the server

Also, I will be an ass, but........
Depending on the config:
NFS outperforms iSCSI
iSCSI outperforms NFS
Both perform terribly in my view as are a bit tricky to maintain.
If you want simplicity -> local RAID still wins in latency.

Small note:

Fibre Channel was purpose-built for SAN delivered the most stable and ow latency shared storage - and I loved it as it truly was a set up and forget setup.

@havoc

Sorry for that, I did miss by a mile, and that was not my point. I thought you encountered the same issues as we did, expecting wow, while getting mehhhhh.

ZizzyDizzyMC · February 28

@host_c said:

@ZizzyDizzyMC said: I spent about 4 weeks tuning and exporting storage over 40g from truenas using iscsi and the performance in proxmox was always abysmal, switched to nfs export and got 7gbps throughput (up from 1.3) and a 5x increase in available iops.

Any idea what I did wrong?

ISCSI writes by default are synchronous/SYNC, you most probably mounted NFS ASYNC, hence the 3-7X difference, that slow performance over ISCSI is the real performance of your raid array sincerely ( I presume 4 disks? ) , ASYNC is RAM CACHED, SYNC is flushed to the disk in real time + ACK that it got written back to the node - much much slower yet much much safer integrity wise in case a power outage / kernel stalls on both ends.

iSCSI (sync) = disk performance - the hard core truth
NFS (async) = RAM performance - the nice multi GBPS you wish to see, but it is a lye

NFS can also be SYNC, in that case it will behave lice ISCSI SYNC, yet most use it ASYNC.

Also, Promxox ISCSI implementation is a bit ...... not the best in my view compared to vMware or Microsoft ( no offense folks, it is the truth )

Atention!!!!!!!!!!!!
Proxmox does NOT have a clustered filesystem like VMFS5/6.
iSCSI is block access, like physically having those disks in the server you mounted them on. ( well not disks, exported LUN )
It is NOT file-system-aware!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
If you mount the same LUN on 2 servers, bingo, they will write data on each-other's blocks. There are work-arrounds like GFS2 yet, it hardly worth the effort.

I would personally avoid NFS for VM storage as well!!!!!

Why:
Proxmox mounts NFS persistent, if the NFS share becomes unavailable (NAS reboot, switch reload, VLAN issue, usual stuff that can happen), the mount can enter D state. - you are fuked, and this is a node reboot fix only as the kernel will wait indefinitely.

Over time, the node may start misbehaving (hung tasks, blocked I/O, GUI lag, usual shit-show )

Yes, that persistence is part of NFS consistency model and that is a very good thing, but operationally it can be painful, and once you start hitting it a few times and your cluster crashes because the ISO repository is not accessible anymore, you will wipe it out completely from your config, we did.

ISCSI was designed for shared storage to phisical server, that is why you even have ISCSI built in the lan card on servers so you can boot the OS from ISCSI while having 0 drives in the server

Also, I will be an ass, but........
Depending on the config:
NFS outperforms iSCSI
iSCSI outperforms NFS
Both perform terribly in my view as are a bit tricky to maintain.
If you want simplicity -> local RAID still wins in latency.

Small note:

Fibre Channel was purpose-built for SAN delivered the most stable and ow latency shared storage - and I loved it as it truly was a set up and forget setup.

@havoc

Sorry for that, I did miss by a mile, and that was not my point. I thought you encountered the same issues as we did, expecting wow, while getting mehhhhh.

Okay this might actually make sense, my storage cluster is backed by 1TB of ram, and so far no client has managed to saturate it.
Also NFS mount has held stable for several hundred days.
Only reason the server doesn't have local storage is because I purposely split storage and compute, as it was cheaper that way to maintain quality service. Still is even with price increases in hardware - but most people aren't at PB scale like me,

imok · February 28

I use storage too 😀

host_c · February 28

@ZizzyDizzyMC said: Only reason the server doesn't have local storage is because I purposely split storage and compute

We did start out the same in this case. And yes, It s much more "smart" that way, you only deal with 1 or 2 large storage units. - that is a big plus.

I also agree that it is stable, till one day it is not. Furthermore, if you do not use MLAG or MPIO, you have a hard time when you need to do FW updates on the switch or have an actual network error, and error I mean human error in the DC staff that unplugged the wrong SFP during an install or an upgrade - that happens more then some think. Or a DAC/AOC/SFP module just dyes, or a NIC fails.

Also, a thing I learned while working in Corporate, there is a reason Real Storage appliances like EMC, NetAPP or other have 2x controllers, the expansion shelves have 2x controllers and in this use case use SAS drives that re DP ( dual port ). In case a controller goes down, the other one takes over, and you do not loose the storage that is shared via whatever protocol you choose. ( sharing part involves redundant network setup, but you also need redundant access for the storage itself )

The main issue if you do centralized storage and not use enterprise purpose built systems, the failure of any 1 part in the mix ( switch, storage, shelf etc..... ) will lead to silent data corruption on all the vm's that use that share on the nodes.

We tested this out with Proxmox = Not pretty, all dyed, vMware is much more "smart" on this as it will put the VM in instant standby until the storage comes back online ( proxmox did not, yet we did not test this on 9.1 ).

Since we switched to Proxmox, for the above reasons we moved away from centralized storage to local one. If one SRV fails, then you have to deal with that one server + customers, you basically limit your blast radius.

I am not saying your setup is wrong, please don't take it like that, I am just saying that the risk in my view, does not worth it. Too many moving parts in the mix, and Proxmox is not for this. If we would have remained on vMware, then yes, we would have stayed on centralized storage also.

Cheers!

EDIT:

I also learn to presume that eventually 1 part will always fail, as an example, if a memory module dyes in the storage unit I built , I have to take offline all the VM's to change that module. Not pretty. Hence I personally prefer an as modular setup as possible. This is a personal decision, not a best practice or a recommendation.

beagle · March 1

I saw some chatter on the TrueNAS forums that NVME-oF may bring some performance to HDD arrays.

Being a more modern protocol it seems to be more efficient and a natural substitute to the ageing iSCSI.

chengzi · March 2

I am running bcachefs with two enterprise NVMe (for btree + write back cache) + HDDs, works fine for me

Glad I bought the two enterprise 4TB NVMe second hand for $350 last year

Do you think we will see a return to SSD/NVMe cached HDD in the LE* world?

Comments