Inconsistent fio results on similar VPSes on the same node

Not_Oles · September 7

I am seeing some inconsistency on fio tests in yabs run today on similar spec VPSes on the same node. Please see three examples below.

I'm unclear on what's happening, whether it's related to any single VPS that I happen to be testing, whether it's related to some other VPS or some node process using high file I/O at certain times, or maybe something else.

I've been watching iotop -b 3 -o a little. So far, no obvious insight.

Ideas? Thanks!

fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 193.28 MB/s  (48.3k) | 1.78 GB/s    (27.8k)
Write      | 193.79 MB/s  (48.4k) | 1.79 GB/s    (28.0k)
Total      | 387.07 MB/s  (96.7k) | 3.57 GB/s    (55.9k)
           |                      |                     
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 2.12 GB/s     (4.1k) | 2.18 GB/s     (2.1k)
Write      | 2.23 GB/s     (4.3k) | 2.33 GB/s     (2.2k)
Total      | 4.35 GB/s     (8.5k) | 4.51 GB/s     (4.4k)

fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 193.53 MB/s  (48.3k) | 1.95 GB/s    (30.4k)
Write      | 194.04 MB/s  (48.5k) | 1.96 GB/s    (30.6k)
Total      | 387.57 MB/s  (96.8k) | 3.91 GB/s    (61.1k)
           |                      |                     
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 957.00 KB/s      (1) | 18.49 MB/s      (18)
Write      | 1.12 MB/s        (2) | 20.26 MB/s      (19)
Total      | 2.07 MB/s        (3) | 38.75 MB/s      (37)

fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 16.51 MB/s    (4.1k) | 1.84 GB/s    (28.7k)
Write      | 16.52 MB/s    (4.1k) | 1.85 GB/s    (28.9k)
Total      | 33.03 MB/s    (8.2k) | 3.69 GB/s    (57.7k)
           |                      |                     
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 2.11 GB/s     (4.1k) | 2.19 GB/s     (2.1k)
Write      | 2.22 GB/s     (4.3k) | 2.34 GB/s     (2.2k)
Total      | 4.34 GB/s     (8.4k) | 4.53 GB/s     (4.4k)

skhron · September 7

Is this performance drop happens for all I/O related to physical drives at host machine or it is only some virtual machines facing I/O performance drop occasionally? Try checking with iostat.

havoc · September 7

Do you have access to the node itself?

I reckon actual flakey hardware issue.

512k read returning 957.00 KB/s sometimes and 2BG/s others doesn't feel like a noisy neighbor issue. And even within the VPS the numbers don't make sense for #2. More throughput for 64k than 1m?

Are you sure that even if on same node they're on same storage backend?

somik · September 7

@havoc said:
I reckon actual flakey hardware issue.

I had a failing hard drive that caused similar issue so i'll included to go with @havoc's suggestion. If you have allocated space for your VMs such that one or two of the VM's data partition falls within the part of the drive with bad blocks, you can have this issue. Do note that this is only applicable if you are on a legacy HDD and not a SSD.

Not_Oles · September 7

@skhron said:
Is this performance drop happens for all I/O related to physical drives at host machine or it is only some virtual machines facing I/O performance drop occasionally? Try checking with iostat.

Definitely a good question whether host machine or only hosted virtual machines are affected.

One of the pretty smart and well experienced users told me this morning that he is seeing SCSI errors on his VPS, for the first time, recently. I haven't been aware of issues on the node itself.

So I have to look into it some more today.

Thanks for helping!

Not_Oles · September 7

@havoc said:
Do you have access to the node itself?

Yes.

I reckon actual flakey hardware issue.

Sounds right.

512k read returning 957.00 KB/s sometimes and 2BG/s others doesn't feel like a noisy neighbor issue. And even within the VPS the numbers don't make sense for #2. More throughput for 64k than 1m?

Are you sure that even if on same node they're on same storage backend?

Yes, it seems that way. I can see via ssh all the disks on the node. I can see via ssh the hardware RAID controller on the node. I haven't physically seen the machine.

Thanks for helping!

Not_Oles · September 7

@somik said: one or two of the VM's data partition falls within the part of the drive with bad blocks

This is another good question!

There is a hardware RAID controller and 8 spinning rust disks in RAID 10. The RAID controller has tests. The RAID array status is reported as "Optimal."

Not_Oles · September 7

I guess today I ought to try some organized testing to try to figure out whether the node also is seeing I/O issues or just VMs.

I ought to look at the logs on the VMs with poor test results. Something tells me I'm not gonna find anything in the VM logs.

Hello iostat.

havoc · September 7

Should show up in SMART data i would think. Run on, redo the fio, run another and see what moved

somik · September 8

@havoc said:
Should show up in SMART data i would think.

In my case, i had to run a manual smart test with the command sudo smartctl -t long /dev/sda before the bad sectors were registered. Before that I was only getting slow copy/write operations with smar panel not reporting any issues... Probably had to do with the bad firmware on my drive...

Not_Oles · September 8

After some testing, it seems that, following a recent update on the Ubuntu node,

New Debian VPSes don't seem to want to boot, and
Existing Debian VPSes still work, but maybe have inconsistent file I/O.

Nevertheless,

New Ubuntu VPSes seem to work just fine!
A new OpenBSD VPS also seem to work fine!

Of course, post hoc doesn't mean propter hoc. So I'm not saying the recent update is a cause.

I'm still pretty confused. But, it's all a lot of fun! And, I get to increase my appreciation for the effort required to maintain virtualization systems!

It will be interesting to see what happens in the upcoming days!

Falzo · September 8

There is nothing wrong with it. You are just hitting cache limits.
See my post on OGF (I hate cross posting).

Spinning rust is not capable of these numbers anyway, so all you see is the cache working and whenever several layers clash and everything is flushed the real bottleneck comes to light.

TL;DR; you need to choose a better layout for cache, scheduler and flushing behaviour to keep a better balance

somik · September 8

@Falzo said:
See my post on OGF (I hate cross posting).

I can help with that!

Inconsistent fio results on similar VPSes on the same node

Comments