There's a relatively little known feature about Linux IO scheduling that has
a pretty significant effect in large scale database deployments at least with
MySQL that a recent article on
MySQL Performance Blog prompted me to write about. This may have an effect
on other databases and random I/O systems as well, but we've definitely seen it
with MySQL 5.0 on RHEL 4 platform. I have not studied this on RHEL 5, and since
the IO subsystem and the Completely Fair Queue scheduler that is default on
RHEL kernels has received further tuning since, I can not say if it still
exists.
Though I've heard YouTube discovered these
same things, I have not yet seen a simple explanation of why this is so -
so I'll take a shot at explaining it.
In short, a deployment with a RAID controller or external storage system
visible to the operating system as a single block device will not reach its
maximum performance under RHEL default settings, and can be easily coaxed about
20% higher on average random I/O (and significantly higher in spot benchmarks)
with a single kernel parameter (elevator=noop) or equivalent
runtime tuning via /sys/block/*/queue/scheduler in RHEL5, where you can also
set this on a per-device basis.
We first saw this in 2005
on a quad-CPU server with a RAID controller connected to 10 SCSI disks. At that
time, we found that configuring the RAID to expose five RAID-1 pairs which we
then striped to a single volume using LVM increased performance despite making
the OS and CPU do more work on I/O. The difference in performance was about
20%.
Our most recent proof of the same effect was a quad-CPU server connected to
a NetApp storage system over FC. Since it was not convenient to expose multiple
volumes from the NetApp to stripe them together, we searched for other
solutions, and prompted by a presentation by the YouTube engineers looked at
the I/O scheduling options and found a simple way to improve performance was to
turn off I/O reordering by the kernel. Again, the overall impact between the
settings was about 20%, though at times much greater.
The lesson is simple: reordering I/O requests multiple times provides no
benefits, and reordering them too early will in fact be detrimental. Explaining
why that is so is a bit involving, and is based on a few assumptions we have
not bothered to verify, since the empirical results have supported our
conclusions and got us where we wanted.
In order to keep the explanation simple, I will describe it conceptually on
a very small scale. When reading this, please take this into account and
understand that to measure the effect we have seen in practice, the size of the
solution should be increased from what I am describing.
First, consider the case of
direct-attached storage exposed to the Linux kernel as independent devices. In
this configuration, the kernel maintains a per-device I/O queue, and the CFQ
scheduler will reorder I/O requests to each device separately in order to
maintain fair per-process balancing, low latency and high throughput. This is
the configuration in which CFQ does a great job of maximizing performance, and
works fairly well with any amount of spindles. As the application (a database
in this case) fires random I/O, each of the spindles is executing them
independently and serves requests as soon as they are issued. In other words,
the system is good at keeping each of the I/O queues "hot". The sustained top
I/O rate is roughly linear to the number of spindles, or with 15k rpm drives,
about 1000 ops for four drives.
Now, lets introduce a hardware RAID of some sort, in particular one which is
enabled to further reorder operations thanks to a "big" battery-backed up
cache. Thanks to that cache, the RAID can commit thousands of write operations
per second for fairly long periods (seconds), flushing them to disk after
merging. On the other hand, the kernel now sees just one device, and has one
I/O queue to it. The CFQ scheduler sits in front of this queue, reordering
pending I/O requests. All is fine until the I/O pressure rises up to about what
a single spindle can process on a sustained basis, or about 250 requests per
second on those 15k drives. However, as soon as the queue starts building up,
the CFQ scheduler kicks into action, and reorders the queue from random to
sorted per block number (an oversimplification, but close enough).
All is good? No, it's not. The sequential
blocks on that RAID volume are not truly sequential, but reside on different
spindles and could thus be processed simultaneously. To demonstrate, lets
assume your four-spindle array has one billion sectors or five hundred gigs per
device, and further, that it is striped at 64k extents or 7.8 million stripes
across each device.
On both configurations, the striping is essentially the same. Every 128
sectors or 64k is one one device, then the next one, and so on. The difference
is that with LVM in place, the kernel knows this, while with the RAID, it has
no idea of the layout of the array, essentially treating it as a single
spindle.
Now, those couple of thousand request
that were just issued, contain sequences such as writes to sectors 10, 200, 50,
300, 1020, 600, 1500 and 700. Due to the striping, four of these can be
executed simultaneously, so the optimal order to issue these, of course
depending on what else might be going on, is something like 10, 200, 300, 1500,
50, 700, 1020, and 600, executed through four queues: [10, 50, 600], [200,
700], [300] and [1020, 1500]. In the LVM configuration this might be what
really happens. However, the single I/O queue to the RAID device will have
these sorted into ascending block order, and with enough such operations in the
queue, the RAID processor no longer has enough view to the queue to efficiently
re-re-order them to utilize all the spindles, so only some of them are hot at
any given time. TCQ should help, but
in practice it won't issue enough outstanding requests to fix the problem. In
our experience the top sustained rate is not more than 1.5 times one spindle,
or 300-400 requests per second, while the array should really run at over the
1000 ops per second thanks to the additional persistent cache on the RAID
controller.
Bottom line: CFQ is great, but only if the kernel actually knows everything
about the physical layout of the media. It also looks like some of the recently
introduced tuning parameters (which I know nothing about, just noted their
appearance) might help avoid the worst hit. However, ultimately it doesn't
matter - if your hardware allows efficient "outsourcing" of the I/O scheduling
to a large secure cache, use it, and don't bother making the kernel do the job
without all the information.
I hope this explanation makes sense, and that I haven't botched any
important details or made wrong assumptions. Please comment if any of this is
inaccurate.
PS. A tuning guide for Oracle
recommends the deadline scheduler due to latency guarantees. We have not
benchmarked that against noop.