Why Hosting Support Blamed Me for High I/O Wait and How I Used strace to Prove It Was Their Block Storage Latency

Jonathan Dough

6 hours ago

It all started innocently enough—my website, hosted on a virtual private server (VPS), began lagging at random times. The page loads would hang, database queries took ages, and visitors were bouncing because of painfully slow response times. Naturally, I went to my hosting provider’s support team. What followed was a frustrating blame game, with support insisting that my applications were poorly optimized. But after a bit of investigation and some low-level sleuthing with strace, I narrowed it down to a surprising culprit: their own block storage setup.

TL;DR

I experienced sudden performance issues on my VPS, and hosting support repeatedly blamed my stack, pointing to high I/O wait. However, using strace, I uncovered the real cause—significant delays in disk I/O pointing to problematic block storage latency. This article walks you through how I identified the issue and used evidence to compel the hosting provider to investigate and ultimately acknowledge their infrastructure problem. It’s a story every system administrator, DevOps engineer, or power user should read.

Understanding I/O Wait and Why It Matters

I/O wait is the percentage of time the CPU is idle while waiting for I/O operations to complete. That typically means the system is waiting for data from disk or network. In a healthy, well-optimized server, I/O wait should be relatively low—usually in the low single digits. When it spikes above 10–15%, something is wrong.

When I contacted my hosting provider, they quickly checked my CPU and memory usage and then said something along the lines of:

“You’re experiencing high I/O wait due to inefficient application code or poorly configured services like MySQL or Apache. Please optimize your stack.”

They even pointed to utilities like top and iostat that showed high wa (I/O wait) percentages and suggested caching more aggressively with tools like Redis or Varnish.

But something didn’t add up. The server had been running the same stack for months without issues, and these new slowdowns seemed to happen sporadically, even during periods of low traffic.

When Support Becomes a Dead End

I tweaked MySQL buffer sizes, disabled unneeded cron jobs, and even temporarily shut down the site to eliminate traffic. The performance issues persisted. No matter how hard I tried to optimize things on my end, I kept hitting a wall.

Eventually, I started collecting more detailed metrics using tools like:

vmstat – to monitor CPU wait times and context switches
iotop – to see which processes were responsible for disk I/O
dstat – for a real-time view of CPU, disk, and network stats

The pattern became clear. Disk operations were slowing everything down, and it wasn’t due to application load.

This is where strace came into play.

The Power of `strace`: System Calls Don’t Lie

strace is a powerful Linux utility that lets you trace system calls and signals of a process. It shows when a process makes a call to the operating system—like opening a file, writing to disk, or waiting for input—and how long each of those calls takes. Essentially, it’s like peeking under the hood of your car while it’s running.

I used it first on a MySQL process and later on a simple script that performed read/write operations. Here’s where things got interesting. I noticed that disk-related system calls (e.g., read, write, fsync) were taking an unusually long time, sometimes several seconds for what should be millisecond operations.


write(3, "data", 4) = 4 <-- this is fine
fsync(3) = 0 <-- this took 3.2 seconds!

That fsync call, which flushes buffered data to disk, was a smoking gun. There’s no reason it should take more than a fraction of a second—unless the storage backend is holding things up.

And when I ran the process during off-peak times versus peak hours, I found that delay times fluctuated wildly, indicating the problem was external, i.e., the block storage system shared with other VPS customers.

Correlating with System-Wide Metrics

To build a more convincing case, I correlated these strace results with system-wide metrics:

High I/O queue lengths from iostat -x
Increasing read/write latency patterns even during low CPU load
Minimal traffic but slow response times—making it unlikely to be app-related

Everything pointed toward disk latency being the root of the problem. To further test it, I created a dummy file and used a simple Python script to write and then flush data repeatedly, measuring the time taken for each operation.


import time

with open("testfile", "w") as f:
    start = time.time()
    f.write("some test data\n")
    f.flush()
    os.fsync(f.fileno())
    end = time.time()

print(f"Write and fsync took {end - start} seconds")

During peak hours, that operation took anywhere from 1.5 to 4 seconds. During off-hours—just 0.01 to 0.03 seconds. That variability screamed “shared resource bottleneck.”

Pushing Back with Evidence

Armed with logs, syscalls, and measurements from multiple tools, I composed a detailed support ticket. I included:

Evidence from strace showing delayed disk operations
Time-correlated iostat data showing high I/O wait and queue length
Custom script outputs demonstrating block storage performance degradation

Only then did the tone change. Support escalated the case and admitted they were aware of degraded performance on one of their storage nodes. They moved my VPS to a different block device, and like magic, the performance issues vanished.

No more sluggish database responses. No more prolonged writes. CPU usage normalized. I/O wait dropped to manageable levels. Everything worked as expected again.

Key Lessons for VPS Users

Whether you’re running a simple blog, a SaaS app, or a production database, these are key takeaways:

Hosting providers often assume the problem is your fault. Always collect your own metrics before accepting their verdict.
Understand the difference between CPU usage and I/O wait. High CPU is often your code, but high I/O wait is usually the storage backend or file system behavior.
strace is your friend. It provides undeniable proof of delays that support teams can’t easily refute.
Use correlative tools like iostat and vmstat. A single metric rarely tells the whole story.

Final Thoughts

Most of us are at the mercy of our hosting providers, especially when dealing with shared infrastructure. But that doesn’t mean we have to accept poor performance without question. By leaning into the diagnostics already available to us—especially tools like strace—we can cut through the noise, uncover the root cause, and hold our hosts accountable.

If you’ve ever had phantom lags, unexplained slowness, or support teams deflecting responsibility, don’t give up. Sometimes, all it takes is a well-timed strace to turn the tide.