The Problem with Generalizations: A Response to "The Problem with Benchmarks" by raindog308
I was going to post this in the Rants category, but would rather not confine my thoughts on this to only LES members (the Rants category requires you to be signed in to view the threads).
YABS – and many others like it over the years – attempts to produce a meaningful report to judge or grade the VM. It reports CPU type and other configuration information, then runs various disk, network, and CPU tests to inform the user if the VPS service he’s just bought is good, bad, or middling. But does it really?
I stumbled upon a blog post recently from raindog308 on the Other Green Blog and was amused that YABS was called out. Raindog states that YABS (and other benchmark scripts/tests like it) may be lacking in its ability to "produce a meaningful report to judge or grade the VM". Some of the reasons for discrediting and proposed alternatives had me scratching my head. I notice that raindog has been hard at work lately pumping up LEB with good content. But is he really?
I'm going to cherry pick some quotes and arguments to reply to below -
It’s valid to check CPU and disk performance for outliers. We’ve all seen overcrowded nodes. Hopefully, network performance is checked prior to purchase through test files and Looking Glass.
I'd argue that not all providers have readily available test files for download and/or a LG. Additionally, it can also be misleading when hosts simply link to their upstream's test files or have their LG hosted on a different machine/hardware that may not have the same usage patterns and port speeds that one would expect in the end-user's VM. However, the point is noted to do some due diligence and research the provider a bit, as that's certainly important.
I'd also argue that iperf (which YABS uses for the network tests) is much more complex than a simple test file/LG. If all you care about is a quick, single-threaded, single-direction, HTTP download then sure use the test file to your heart's content. BUT if you actually care about overall capacity and throughput to different areas of the world in BOTH directions (upload + download), then a multi-threaded, bi-directional iperf test can be much more telling of overall performance.
But other than excluding ancient CPUs and floppy-drive-level performance, is the user really likely to notice a difference day-in and day-out between a 3.3Ghz CPU and a 3.4Ghz one? Particularly since any operation touches many often-virtualized subsystems.
I actually laughed out loud at this comment. I guess I didn't realize that people use benchmarking scripts/tools to differentiate between a "3.3Ghz CPU and a 3.4Ghz one"... (Narrator: "they don't").
Providers have different ways that they fill up their nodes -- overselling CPU, disk space, network capacity, etc. is, more often than not, mandatory to keep prices low. Most (all?) providers are doing this in some form and the end-user most of the time is none the wiser as long as the ratios are done right and the end-user has resources available to meet their workload.
A benchmarking script/tool does help identify cases where the provider's nodes are oversubscribed to excess and is immediately obvious if disk speeds, network speeds, or CPU performance are drastically lower than they should be for the advertised hardware. Could this be a fluke and resolve itself just a few minutes later? Certainly possible. On the flip side, could performance of a system with good benchmark results devolve into complete garbage minutes/hours/days after the test is ran? Certainly possible as well. Multiple runs of a benchmark tool spread out over the course of hours, days, or weeks could potentially help identify if either of these cases are true.
On a personal note, I've seen dozens of instances where customers of various providers post their benchmark results and voice some concerns of system performance. And a large percentage of the time, the provider is then able to rectify the issues presented by fixing hardware issues, identifying abusers on the same node that are impacting performance, etc. From this, one can notice patterns starting to emerge where you can see the providers that take criticisms (via posts containing low-performing benchmarks) to better improve their services and ensure their customers are happy with the resources that they paid for. Other trends help identify providers to avoid where consistent low network speeds, CPU scores, etc. go unaddressed, indicating unhealthy overselling. But I digress...
If I could write a benchmark suite, here is what I would like it to report.
Here we get a rapid-fire of different un-quantifiable metrics that would be in raindog's ideal benchmarking suite:
- "Moldy Oldies" (outdated VM templates)
- "Previous Residents" (previous owners of IPs)
- "My Neighbors" (anybody doing shitcoin mining on the same node)
Raindog realizes it'd be hard to get at these metrics -
Unfortunately, all of these things are impossible to quantify in a shell script.
They'd be impossible to quantify by any means (shell script or fortune teller)... Almost all of the above metrics are subject to personal opinions and preference. Some of these can be investigated by other means -- reliability: one could check out a public status page for that provider (if available); problems: one could search for public threads of people having issues and noting how the provider responds/resolves the issues; moldy oldies: a simple message/pre-sales ticket to the provider could alleviate that concern.
Anyways, the above metrics are highly subjective and are of varying importance to perspective buyers (someone might not give a shit about support response times or if their neighbor is having a party on their VM).
But what's something that everyone is actually concerned about? How the advertised VM actually performs.
And how do we assess system performance in a non-subjective manner? With benchmarking tests.
If you ask me which providers I recommend, the benchmarks that result from VMs on their nodes are not likely to factor into my response. Rather, I’ll point to the provider’s history and these “unquantifiables” as major considerations on which providers to choose.
That's great and, in fact, I somewhat agree here. Being an active member of the low end community, I have the luxury of knowing the providers that run a tight ship and care about their customers and which ones don't based on their track record. But not everyone has the time to dig through thousands of threads to assess a provider's support response or reliability and not everyone in the low end community has been around long enough to develop opinions and differentiate between providers that are "good" or "bad".
I also found it highly amusing that a related post on the same page links to another post regarding a new benchmarking series by jsg. This post is also written by raindog. The framing is a bit different in that post, where benchmarks are described in a way where they aren't represented as entirely useless and lacking in their ability to "produce a meaningful report to judge or grade the VM". I'm not really sure what happened in the few months between those two posts and they both talk about the limitations of benchmarking tool, but just would like to note the changes in tone.
My main point in posting this "response" (read: rant) is that benchmark tests aren't and shouldn't be the all-in-one source for determining if a provider and a VM are right for you. I don't advertise the YABS project in that manner. In the description of the tool I even state that YABS is "just yet another bench script to add to your arsenal." So I'm not really sure of the intent of raindog's blog post. Should users not be happy when they score a sweet $7/year server that has top-notch performance and post a corresponding benchmark showing how sexy it is? Should users not test their system to see if they are actually getting the advertised performance that they are expecting? Should users not use benchmarking tools as a means of debugging and bringing a provider's attention to any resource issues? Those "unquantifiables" that are mentioned certainly won't help you out there.
This response is now much longer than the original blog post that I'm responding to, so I'll stop here.
Happy to hear anyone else's thoughts or face the pitch forks on this one.