Inconsistent fio results on similar VPSes on the same node
I am seeing some inconsistency on fio tests in yabs run today on similar spec VPSes on the same node. Please see three examples below.
I'm unclear on what's happening, whether it's related to any single VPS that I happen to be testing, whether it's related to some other VPS or some node process using high file I/O at certain times, or maybe something else.
I've been watching iotop -b 3 -o
a little. So far, no obvious insight.
Ideas? Thanks!
fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
---------------------------------
Block Size | 4k (IOPS) | 64k (IOPS)
------ | --- ---- | ---- ----
Read | 193.28 MB/s (48.3k) | 1.78 GB/s (27.8k)
Write | 193.79 MB/s (48.4k) | 1.79 GB/s (28.0k)
Total | 387.07 MB/s (96.7k) | 3.57 GB/s (55.9k)
| |
Block Size | 512k (IOPS) | 1m (IOPS)
------ | --- ---- | ---- ----
Read | 2.12 GB/s (4.1k) | 2.18 GB/s (2.1k)
Write | 2.23 GB/s (4.3k) | 2.33 GB/s (2.2k)
Total | 4.35 GB/s (8.5k) | 4.51 GB/s (4.4k)
fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
---------------------------------
Block Size | 4k (IOPS) | 64k (IOPS)
------ | --- ---- | ---- ----
Read | 193.53 MB/s (48.3k) | 1.95 GB/s (30.4k)
Write | 194.04 MB/s (48.5k) | 1.96 GB/s (30.6k)
Total | 387.57 MB/s (96.8k) | 3.91 GB/s (61.1k)
| |
Block Size | 512k (IOPS) | 1m (IOPS)
------ | --- ---- | ---- ----
Read | 957.00 KB/s (1) | 18.49 MB/s (18)
Write | 1.12 MB/s (2) | 20.26 MB/s (19)
Total | 2.07 MB/s (3) | 38.75 MB/s (37)
fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
---------------------------------
Block Size | 4k (IOPS) | 64k (IOPS)
------ | --- ---- | ---- ----
Read | 16.51 MB/s (4.1k) | 1.84 GB/s (28.7k)
Write | 16.52 MB/s (4.1k) | 1.85 GB/s (28.9k)
Total | 33.03 MB/s (8.2k) | 3.69 GB/s (57.7k)
| |
Block Size | 512k (IOPS) | 1m (IOPS)
------ | --- ---- | ---- ----
Read | 2.11 GB/s (4.1k) | 2.19 GB/s (2.1k)
Write | 2.22 GB/s (4.3k) | 2.34 GB/s (2.2k)
Total | 4.34 GB/s (8.4k) | 4.53 GB/s (4.4k)
I hope everyone gets the servers they want!
Comments
Is this performance drop happens for all I/O related to physical drives at host machine or it is only some virtual machines facing I/O performance drop occasionally? Try checking with
iostat
.Check our KVM VPS plans in 🇵🇱 Warsaw, Poland and 🇸🇪 Stockholm, Sweden
Do you have access to the node itself?
I reckon actual flakey hardware issue.
512k read returning 957.00 KB/s sometimes and 2BG/s others doesn't feel like a noisy neighbor issue. And even within the VPS the numbers don't make sense for #2. More throughput for 64k than 1m?
Are you sure that even if on same node they're on same storage backend?
I had a failing hard drive that caused similar issue so i'll included to go with @havoc's suggestion. If you have allocated space for your VMs such that one or two of the VM's data partition falls within the part of the drive with bad blocks, you can have this issue. Do note that this is only applicable if you are on a legacy HDD and not a SSD.
Never make the same mistake twice. There are so many new ones to make.
It’s OK if you disagree with me. I can’t force you to be right.
Definitely a good question whether host machine or only hosted virtual machines are affected.
One of the pretty smart and well experienced users told me this morning that he is seeing SCSI errors on his VPS, for the first time, recently. I haven't been aware of issues on the node itself.
So I have to look into it some more today.
Thanks for helping!
I hope everyone gets the servers they want!
Yes.
Sounds right.
Yes, it seems that way. I can see via ssh all the disks on the node. I can see via ssh the hardware RAID controller on the node. I haven't physically seen the machine.
Thanks for helping!
I hope everyone gets the servers they want!
This is another good question!
There is a hardware RAID controller and 8 spinning rust disks in RAID 10. The RAID controller has tests. The RAID array status is reported as "Optimal."
I hope everyone gets the servers they want!
I guess today I ought to try some organized testing to try to figure out whether the node also is seeing I/O issues or just VMs.
I ought to look at the logs on the VMs with poor test results. Something tells me I'm not gonna find anything in the VM logs.
Hello
iostat
.I hope everyone gets the servers they want!
Should show up in SMART data i would think. Run on, redo the fio, run another and see what moved
In my case, i had to run a manual smart test with the command
sudo smartctl -t long /dev/sda
before the bad sectors were registered. Before that I was only getting slow copy/write operations with smar panel not reporting any issues... Probably had to do with the bad firmware on my drive...Never make the same mistake twice. There are so many new ones to make.
It’s OK if you disagree with me. I can’t force you to be right.
After some testing, it seems that, following a recent update on the Ubuntu node,
Nevertheless,
Of course, post hoc doesn't mean propter hoc. So I'm not saying the recent update is a cause.
I'm still pretty confused. But, it's all a lot of fun! And, I get to increase my appreciation for the effort required to maintain virtualization systems!
It will be interesting to see what happens in the upcoming days!
I hope everyone gets the servers they want!
There is nothing wrong with it. You are just hitting cache limits.
See my post on OGF (I hate cross posting).
Spinning rust is not capable of these numbers anyway, so all you see is the cache working and whenever several layers clash and everything is flushed the real bottleneck comes to light.
TL;DR; you need to choose a better layout for cache, scheduler and flushing behaviour to keep a better balance
I can help with that!
Never make the same mistake twice. There are so many new ones to make.
It’s OK if you disagree with me. I can’t force you to be right.