ZFS RAID10 + cache?

imok · April 4

I've just installed a Proxmox node and I have 4x SSDs + 1 NVMe

I've set up the SSDs with ZFS RAID10 and I'm not really sure what to do with the NVMe.

IA says "It can be used as a ZFS cache (L2ARC) if you are going to work with ZFS."

In your experience, will that ZFS cache make things faster? I have 128GB RAM BTW.

host_c · April 4

Nop, For a NVME to be used as cache, it has to be a little more special then an ordinary NVME

I would just use it as a NVME for storage of VPS that have backup in case it fails, or for non critical VM for short.

ZFS loves RAM, as it uses that for cache, RAM is low latency, compared to that NVME is not, so it makes no sense to do that. You will loose more than you will gain.

As you have 128 gb ram, and raid 10 setup, I would also lilit ARC to 16 gb at most, this way you will have a good balance of performance ( raid 10 ) and will not burn too much ram for ZFS, so you have 100gb + for VM ram.

Please do not leave ARK for ZFS in proxmox to auto/dynamic, it works not as you expect ( shit is the right word )

Other than this, Congratulations of choosing the most correct way to use a 4 drive setup

Cheers!

imok · April 4

@host_c always helpful. thank you. I always forget to limit the ARC.

Chievo · April 6

@imok said:
@host_c always helpful. thank you. I always forget to limit the ARC.

What a beast server congrats imok

OffshoreRacks · May 16

If you are asking this here, dont worry about cache, with 4x ssd should be enough speed, just make sure with all vps running you have still plenty of ram available, proxmox 8.1+ with use 10% for arc usage, but you can always modify this to use more. If you still decide to use nvme as L2 arc do yourself a favor and use (2) in mirror

cybertech · May 16

@host_c said:
Nop, For a NVME to be used as cache, it has to be a little more special then an ordinary NVME

I would just use it as a NVME for storage of VPS that have backup in case it fails, or for non critical VM for short.

ZFS loves RAM, as it uses that for cache, RAM is low latency, compared to that NVME is not, so it makes no sense to do that. You will loose more than you will gain.

As you have 128 gb ram, and raid 10 setup, I would also lilit ARC to 16 gb at most, this way you will have a good balance of performance ( raid 10 ) and will not burn too much ram for ZFS, so you have 100gb + for VM ram.

Please do not leave ARK for ZFS in proxmox to auto/dynamic, it works not as you expect ( shit is the right word )

Other than this, Congratulations of choosing the most correct way to use a 4 drive setup

Cheers!

why does the NVMe need to be Special and how special?

host_c · May 16

@cybertech said: why does the NVMe need to be Special and how special?

TBW (Total Bytes Written) — and keep in mind, everyone lies in specs.
Latency — this is the real killer.

If you use a consumer-grade NVMe like the Samsung EVO (or any "Pro"-branded drive), it will wear out quickly — especially depending on your read/write I/O patterns. Writes, in particular, will degrade it fast.

Even if it's a Gen4 NVMe capable of sustaining 5–8 GB/s, it’s still a poor choice — because RAM always wins in terms of latency.

Using a data center-grade NVMe (like Intel DC P-series or similar) is overkill. Just use more RAM.

In practically every situation, what ZFS truly needs is RAM, not an NVMe cache. NVMe cache and similar options are mostly "marketing-driven features" requested by the community. From a performance standpoint, they offer minimal real-world benefit and only add complexity to the setup.

L2ARC (NVMe used as read cache) and ZIL/SLOG (used for synchronous write logging) can offer benefits in very specific workloads — like NFS or databases, and here comes the "but" : my reply here is taking into consideration the following setup:

24 drives in RAID10, 1TB ram, 2x40 GBPS NET links, storage via NFS and ISCSI. NFS sucks as it just sucks and RDMA is hard to implement. ISCSI wins by a mile long, but has specific use cases.

Unless you wish to achieve 40 GBPS and above over whatever protocol, just use RAM.

ZFS was fundamentally designed to benefit from low-latency cache — and that means RAM. Everything else is just improvisation.

As a general rule — not just for storage but in many areas — latency is the real killer. Latency always beats raw read/write throughput.

You’re much better off with storage that has sub-1ms latency and delivers 50 MB/s than something with 3–4ms latency but 500 MB/s. The system feels snappier, more responsive, and performs better in real-world scenarios.

Even if you plan on mostly sequential reads and writes, give it a few months — fragmentation sets in, workloads become more random, and suddenly you’re dealing with lots of random I/O. At that point, low latency becomes even more critical.

PS: after you figure the storage part, the other important thing to consider is 'Delivery" of that storage to destination, as that is as important.

I loved Fiber Channel ( the protocol ), why? because it was built for Storage ( altho Infiniband is definitely worth considering to look into and RDMA is also good )

We have 8/16GBPS FC setups running today ( 10+ years old ) that can punch performance as close to NVME and that performance is delivered to 3-6 vMware Nodes. ( still running 5.X or 6.X). - why not upgarde? as there is absolutely no need for the customer to do so.

I GPT-ed here a compassion between FC and whatever over ethernet:

When to Use What?

Use Fibre Channel when:

You already have an FC SAN.
You need rock-solid performance for transactional workloads.
You have skilled staff and budget for it.

Use Ethernet when:

You prefer flexibility and convergence (single fabric for data and storage).
You use cloud, hyperconverged, or scale-out solutions.
You want to avoid FC infrastructure costs.

Fibre Channel (FC) vs Ethernet – Core Comparison

Feature	Fibre Channel (FC)	Ethernet
Purpose	Purpose-built for storage networking (SAN)	General-purpose networking; also used for storage
Protocol Stack	FC stack (FC-0 to FC-4), deterministic	Ethernet + TCP/IP + storage protocol (e.g. iSCSI, NFS)
Latency	Very low (0.5–2 µs typical)	Higher due to TCP/IP overhead (10–100+ µs)
Jitter	Minimal, predictable	Higher, can vary with congestion
Throughput	8G, 16G, 32G, 64G FC per port	1G, 10G, 25G, 40G, 100G+ Ethernet
Reliability	Extremely reliable, lossless by design	Ethernet is lossy by default unless tuned
Transport	Native block-level (FCP), NVMe/FC	iSCSI, NFS, SMB, NVMe/TCP
Infrastructure	Requires dedicated FC switches and HBAs	Standard Ethernet switches and NICs
Cost	High – enterprise-grade equipment	Lower – commodity hardware
Scalability	High within SAN fabrics	High across LAN/WAN – IP routable
Complexity	Specialized, requires FC expertise	Familiar to most admins
Use Cases	Enterprise SANs, critical block storage	NAS, iSCSI SANs, hyperconverged setups

imok · May 16

@host_c said: fragmentation

host_c · May 16

@imok said:

@host_c said: fragmentation

The good old times.

AuroraZero · May 16

@host_c said:

@imok said:

@host_c said: fragmentation

The good old times.

What da hail you talking about mine still looks this way!!!

havoc · May 16

Awesome timing. Been wondering about this too for a project.

I suspect with main pool being SSD and thus already pretty fast any sort of caching layer beyond RAM is of limited benefit. So I'm thinking more metadata & small files (vaguely recall that specials can do both at same time). Think metadata on a fast device would improve perceived snappiness. Optane drive for that would be ideal but haven't quite figured out whether that's worth it. If all the metadata is on there then probably need two for risk. At which point it's quite an elaborate & pricey setup for possibly not a huge benefit. Plus that needs more PCIE lanes. idk..

Anybody got a good way to test these sort of ZFS things? A straight speed test won't necessarily reflect the trade-offs well (big files vs say many small). Plus stuff involving cache is hard to test anyway

ehab · May 16

@imok said:

used to feel better after.

and if i see B then its a sign to change.

@host_c said:
The good old times.

host_c · May 16

@havoc said:
Awesome timing. Been wondering about this too for a project.

I suspect with main pool being SSD and thus already pretty fast any sort of caching layer beyond RAM is of limited benefit. So I'm thinking more metadata & small files (vaguely recall that specials can do both at same time). Think metadata on a fast device would improve perceived snappiness. Optane drive for that would be ideal but haven't quite figured out whether that's worth it. If all the metadata is on there then probably need two for risk. At which point it's quite an elaborate & pricey setup for possibly not a huge benefit. Plus that needs more PCIE lanes. idk..

Anybody got a good way to test these sort of ZFS things? A straight speed test won't necessarily reflect the trade-offs well (big files vs say many small). Plus stuff involving cache is hard to test anyway

since you wish to use parity raid ( Zx ) you will have most to non in gain from whatever cache you wish to do, other then high CPU Clock and RAM. You will have latency penalty by having to do the math on the stripped blocks for Raid Zxx. ( an the math is done by a CPU that has to deal with other stuff also plus all this over a software stack )

A raid 10 of those SSD would kill all Performance wise.

If you wish to counterbalance the penalty of parity raid with ssd, aaa nop, it will not do the trick, been there done that, but feel free to share your findings. Since you will use consumer SSD ( 6 GBPS link ) you will be limited in terms of speed by that + the parity math for blocks stripped to the pool done by a x86 CPU. ( regardles of the type, make and model )

There is no magic way to do high performance raid and sacrifice as few drives as possible for data integrity.

You either do Raid 10 , or stick with Zx, putting the Zx in stripping ( like raid 50,60.. ) will give you some performance boost, but i doubt it will be anything noticeable. ( first the "controller" will have to divide the data blocks then do the parity and the write them / each cycle )

If this is a local storage setup it will be good, if you wish to transport it to nodes ( nfs,cifs ) you will also add the transport problems over it, outcome will be....... well, not what you aspect ( you will hardly saturate a 10 GBPS line after fragmentation even if you use SSD, and all this if you go MTU9000; aaa..... almost forgot, if you do transport of storage via ETH, Nexus, Juniper is your 2'nd and 3'rd option, the first one is Arista. Be prepared to spend some $$ on electricity as those sw consume some power. Also, you will need cards from Chelsio or other PREM manufacturer to go beyond 10G. ( any raid type with any number of spinning rust will saturate a 1G line any day of the week, things get hard passing 5 GBPS ). Jumbo frames helps, but fragmentation, protocol overhead, and latency often always win. - sad but true.

The above is mostly the reason why ZIL, SLOG and other were added to ZFS to counterbalance the flaws/limitations, but in real life they do not work, or the performance gained / USD spent is a joke. ( real life test is a 20+ TB Poll shared to like 50 PCS of IO hungry VM's, like storage VPS customers )

It is not that ZFS is not good, it is pure math an physics behind how storage works. I know this was not an answer you wished, but again, feel free to experiment yourself. ( just don't break the bank while doing it )

PS:

The holy trinity of storage design—capacity, performance, integrity—can’t all be maxed at once. You choose two, and live with the tradeoffs.

PSS:

The above is the perfect marketing presentation of how awesome it will be, and in real life it will look like this:

and the end user will only see this

havoc · May 16

@host_c said:
.

You clearly know a lot about this!

But think you misunderstood me. I'm not thinking cache, leaning more towards metadata/small files on optane. i.e. the data never goes to main pool.

The holy trinity of storage design—capacity, performance, integrity—can’t all be maxed at once. You choose two, and live with the tradeoffs.

Well given that I did a non-ECC build I think integrity is already looking a touch shaky lol. And yeah, home storage nfs. Probably 2.5gbe so pretty amateur build anyway

@imok - sorry kinda hijacked your thread a bit

host_c · May 16

@havoc said:

@host_c said:
.

You clearly know a lot about this!

But think you misunderstood me. I'm not thinking cache, leaning more towards metadata/small files on optane. i.e. the data never goes to main pool.

The holy trinity of storage design—capacity, performance, integrity—can’t all be maxed at once. You choose two, and live with the tradeoffs.

Well given that I did a non-ECC build I think integrity is already looking a touch shaky lol. And yeah, home storage nfs. Probably 2.5gbe so pretty amateur build anyway

@imok - sorry kinda hijacked your thread a bit

Well, since you went non ecc i would say the build is god for saving videos, and i would stop there.

If you give us some details about number of drives, cpu and ram, share use case and protocol used,i can give you what i would do, if you wish to hear me out.

Also @imok is a cool fella, I doubt he would mind having this chat here on he’s tread, is that correct?

havoc · May 17

@host_c Thanks!

Use case is various homelab stuff. Relevant one for discussion is probably half a dozen nodes will be using this as s3/object storage backend for k8s hence interest in fast small files and snappiness. NFS too but most of LAN is 2.5gbe so no wild throughput expectations....the base sata pool should saturate that.

Ancient Asus X99 platform / 5960X / 64gb 3200 / 3x S3500 1.6 intel dc SATAs which will be main pool. Boot another smaller s3500. Dual 10 gig eth, but most of network is 2.5 so kinda irrelevant

Have 3x P1600X lying around, so could use one or two of those
Can buy a P4800X and connect it to U.2, but would need to be a single, cause 2 would be more than I'd like to spend

Think I'd need to figure out some sort of small file test and try to test this all somehow. Cause if there isn't a big diff vs sata pool then it's all academic.

non ecc i would say the build is god for saving videos, and i would stop there.

No customers sending me angry tickets & no super critical data. I'll build an ECC server in maybe 1.5 years when Desktop is due a refresh & can cannabalize that for a ecc mobo and ecc cpu

imok · May 17

No worries, it's interesting to read you guys. Even if I don't understand the most of it 😅

cybertech · May 17

it was actually an information overload but a good one

host_c · May 17

@havoc said: No customers sending me angry tickets.....

Lucky you, bastard

havoc · May 17

@host_c said:

@havoc said: No customers sending me angry tickets.....

Lucky you, bastard

haha indeed. (though I do have angry customers in day job that is funding my homelab shenanigans...)

Anyway...no storage testing for now. XMP on this board seems broken. So guess I'm figuring out 30+ memory timings settings by hand

host_c · May 17

@havoc said:

@host_c said:

@havoc said: No customers sending me angry tickets.....

Lucky you, bastard

haha indeed. (though I do have angry customers in day job that is funding my homelab shenanigans...)

Anyway...no storage testing for now. XMP on this board seems broken. So guess I'm figuring out 30+ memory timings settings by hand

If you ever feel the need for some angry mobers, ping me

Other then that, have fun on your project, just please, don’t break the bank, it does not worth it

Cheers

@imok , for being cool on your thread.

somik · May 18

@host_c said:

@havoc said:

@host_c said:

@havoc said: No customers sending me angry tickets.....

Lucky you, bastard

haha indeed. (though I do have angry customers in day job that is funding my homelab shenanigans...)

Anyway...no storage testing for now. XMP on this board seems broken. So guess I'm figuring out 30+ memory timings settings by hand

If you ever feel the need for some angry mobers, ping me

Damn, I am so happy my day job does not require me to interact with customers... Respect goes to you guys who can deal with angry/ignorant people without blowing a fuse!

host_c · May 18

@somik said:

@host_c said:

@havoc said:

@host_c said:

@havoc said: No customers sending me angry tickets.....

Lucky you, bastard

haha indeed. (though I do have angry customers in day job that is funding my homelab shenanigans...)

Anyway...no storage testing for now. XMP on this board seems broken. So guess I'm figuring out 30+ memory timings settings by hand

If you ever feel the need for some angry mobers, ping me

Damn, I am so happy my day job does not require me to interact with customers... Respect goes to you guys who can deal with angry/ignorant people without blowing a fuse!

Me and Stuart after a day that has no "angry mob" tickets

ZFS RAID10 + cache?

Comments