Chapter 4. Operating System and Hardware Optimization
Your MySQL server can perform only as well as its weakest link, and the operating system and hardware on which it runs are often limiting factors. The disk size, the available memory and CPU resources, the network, and the components that link them all limit the system’s ultimate capacity. Thus, you need to choose your hardware carefully and configure the hardware and operating system appropriately. For example, if your workload is I/O bound, one approach is to design your application to minimize MySQL’s I/O workload. However, it’s often smarter to upgrade the I/O subsystem, install more memory, or reconfigure existing disks. If you’re running in a cloud-hosted environment, the information in this chapter can still be very useful, especially for understanding filesystem limitations and Linux I/O schedulers.
What Limits MySQL’s Performance?
Many different hardware components can affect MySQL’s performance, but the most frequent bottleneck we see is CPU exhaustion. CPU saturation can happen when MySQL tries to execute too many queries in parallel or when a smaller number of queries runs for too long on the CPU.
I/O saturation can still happen but much less frequently than CPU exhaustion. This is largely because of the transition to using solid-state drives (SSDs). Historically, the performance penalty of no longer working in memory and going to the hard disk drive (HDD) was extreme. SSDs are generally 10 to 20 times faster than SSH. Nowadays, if queries need to hit disk, you’re still going to see decent performance from them.
Memory exhaustion can still happen but usually only when you try to allocate too much memory to MySQL. We talk about optimal configuration settings to prevent this in “Configuring Memory Usage” in Chapter 5.
How to Select CPUs for MySQL
You should consider whether your workload is CPU bound when upgrading current hardware or purchasing new hardware. You can identify a CPU-bound workload by checking the CPU utilization, but instead of looking only at how heavily your CPUs are loaded overall, look at the balance of CPU usage and I/O for your most important queries, and notice whether the CPUs are loaded evenly.
Broadly speaking, you have two goals for your server:
- Low latency (fast response time)
- To achieve this, you need fast CPUs because each query will use only a single CPU.
- High throughput
- If you can run many queries at the same time, you might benefit from multiple CPUs to service the queries.
If your workload doesn’t utilize all of your CPUs, MySQL can still use the extra CPUs for background tasks such as purging InnoDB buffers, network operations, and so on. However, these jobs are usually minor compared to executing queries.
Balancing Memory and Disk Resources
The main reason to have a lot of memory isn’t so you can hold a lot of data in memory: it’s ultimately so you can avoid disk I/O, which is orders of magnitude slower than accessing data in memory. The trick is to balance the memory and disk size, speed, cost, and other qualities so you get good performance for your workload.
Caching, Reads, and Writes
If you have enough memory, you can insulate the disk from read requests completely. If all your data fits in memory, every read will be a cache hit once the server’s caches are warmed up. There will still be logical reads from memory but no physical reads from disk. Writes are a different matter, though. A write can be performed in memory just as a read can, but sooner or later it has to be written to the disk so it’s permanent. In other words, a cache can delay writes, but caching cannot eliminate writes as it can for reads.
In fact, in addition to allowing writes to be delayed, caching can permit them to be grouped together in two important ways:
- Many writes, one flush
- A single piece of data can be changed many times in memory without all of the new values being written to disk. When the data is eventually flushed to disk, all the modifications that happened since the last physical write are permanent. For example, many statements could update an in-memory counter. If the counter is incremented one hundred times and then written to disk, one hundred modifications have been grouped into one write.
- I/O merging
- Many different pieces of data can be modified in memory, and the modifications can be collected together, so the physical writes can be performed as a single disk operation.
This is why many transactional systems use a write-ahead logging strategy. Write-ahead logging lets them make changes to the pages in memory without flushing the changes to disk, which usually involves random I/O and is very slow. Instead, they write a record of the changes to a sequential logfile, which is much faster. A background thread can flush the modified pages to disk later; when it does, it can optimize the writes.
Writes benefit greatly from buffering because it converts random I/O into more sequential I/O. Asynchronous (buffered) writes are typically handled by the operating system and are batched so they can be flushed to disk more optimally. Synchronous (unbuffered) writes have to be written to disk before they finish. That’s why they benefit from buffering in a Redundant Array of Inexpensive Disks (RAID) controller’s battery-backed write-back cache (we discuss RAID a bit later).
What’s Your Working Set?
Every application has a “working set” of data—that is, the data that it really needs to do its work. A lot of databases also have plenty of data that is not in the working set. You can imagine the database as a desk with filing drawers. The working set consists of the papers you need to have on the desktop to get your work done. The desktop represents main memory in this analogy, while the filing drawers are the hard disks. Just as you don’t need to have every piece of paper on the desktop to get your work done, you don’t need the whole database to fit in memory for optimal performance—just the working set.
When dealing with HDDs, it was good practice to try to find an effective memory-to-disk ratio. This was largely due to the slower latency and low input/output operations per second (IOPS) of HDDs. With SSDs, the memory-to-disk ratio becomes far less important.
Solid-State Storage
Solid-state (flash) storage is the standard for most database systems, especially online transaction processing (OLTP). Only on very large data warehouses or legacy systems would you typically find HDDs. This shift came as the price of SSDs dropped significantly around 2015.
Solid-state storage devices use nonvolatile flash memory chips composed of cells instead of magnetic platters. They’re also called nonvolatile random access memory (NVRAM). They have no moving parts, which makes them behave very differently than hard drives.
Here’s a quick summary of flash performance. High-quality flash devices have:
- Much better random read and write performance compared to hard drives
- Flash devices are usually slightly better at reads than writes.
- Better sequential read and write performance than hard drives
- However, it’s not as dramatic an improvement as that of random I/O because hard drives are much slower at random I/O than they are at sequential I/O.
- Much better support for concurrency than hard drives
- Flash devices can support many more concurrent operations, and in fact, they don’t really achieve their top throughput until you have lots of concurrency.
The most important things are improvements in random I/O and concurrency. Flash memory gives you very good random I/O performance at high concurrency.
An Overview of Flash Memory
Hard drives with spinning platters and oscillating heads had inherent limitations and characteristics that are consequences of the physics involved. The same is true of solid-state storage, which is built on top of flash memory. Don’t get the idea that solid-state storage is simple. It’s actually more complex than a hard drive in some ways. The limitations of flash memory are pretty severe and hard to overcome, so the typical solid-state device has an intricate architecture with lots of abstractions, caching, and proprietary “magic.”
The most important characteristic of flash memory is that it can be read many times rapidly and in small units, but writes are much more challenging. A cell can’t be rewritten without a special erase operation and can only be erased in large blocks—for example, 512 KB. The erase cycle is slow and eventually wears out the block. The number of erase cycles a block can tolerate depends on the underlying technology it uses—more about this later.
The limitations on writes are the reason for the complexity of solid-state storage. This is why some devices provide stable, consistent performance and others don’t. The magic is all in the proprietary firmware, drivers, and other bits and pieces that make a solid-state device run. To make write operations perform well and avoid wearing out the blocks of flash memory prematurely, the device must be able to relocate pages and perform garbage collection and so-called wear leveling. The term write amplification is used to describe the additional writes caused by moving data from place to place, writing data and metadata multiple times due to partial block writes.
Garbage Collection
Garbage collection is important to understand. To keep some blocks fresh and ready for new writes, the device reclaims blocks. This requires some free space on the device. Either the device will have some reserved space internally that you can’t see or you will need to reserve space yourself by not filling it up all the way; this varies from device to device. Either way, as the device fills up, the garbage collector has to work harder to keep some blocks clean, so the write amplification factor increases.
As a result, many devices get slower as they fill up. How much slower is different for every vendor and model and depends on the device’s architecture. Some devices are designed for high performance even when they are pretty full, but in general, a 100 GB file will perform differently on a 160 GB SSD than on a 320 GB SSD. The slowdown is caused by having to wait for erases to complete when there are no free blocks. A write to a free block takes a couple of hundred microseconds, but an erase is much slower—typically a few milliseconds.
RAID Performance Optimization
Storage engines often keep their data and/or indexes in single large files, which means RAID is usually the most feasible option for storing a lot of data. RAID can help with redundancy, storage size, caching, and speed. But as with the other optimizations we’ve been looking at, there are many variations on RAID configurations, and it’s important to choose one that’s appropriate for your needs.
We won’t cover every RAID level here, or go into the specifics of exactly how the different RAID levels store data. Instead, we focus on how RAID configurations satisfy a database server’s needs. These are the most important RAID levels:
- RAID 0
-
RAID 0 is the cheapest and highest-performance RAID configuration, at least when you measure cost and performance simplistically (if you include data recovery, for example, it starts to look more expensive). Because it offers no redundancy, we do not think RAID 0 is ever appropriate on a production database, but if you were truly looking to save costs, it can be a choice in development environments where a full server failure does not turn into an incident.
Again, note that RAID 0 does not provide any redundancy, even though “redundant” is the R in the RAID acronym. In fact, the probability of a RAID 0 array failing is actually higher than the probability of any single disk failing, not lower!
- RAID 1
-
RAID 1 offers good read performance for many scenarios, and it duplicates your data across disks, so there’s good redundancy. RAID 1 is a little bit faster than RAID 0 for reads. It’s good for servers that handle logging and similar workloads because sequential writes rarely need many underlying disks to perform well (as opposed to random writes, which can benefit from parallelization). It is also a typical choice for low-end servers that need redundancy but have only two hard drives.
RAID 0 and RAID 1 are very simple, and they can often be implemented well in software. Most operating systems will let you create software RAID 0 and RAID 1 volumes easily.
- RAID 5
-
RAID 5 used to be quite scary for database systems, largely due to the performance implications. With SSDs becoming commonplace, it’s now a viable option. It spreads the data across many disks with distributed parity blocks so that if any one disk fails, the data can be rebuilt from the parity blocks. If two disks fail, the entire volume will fail unrecoverably. In terms of cost per unit of storage, it’s the most economical redundant configuration because you lose only one disk’s worth of storage space across the entire array.
The biggest “gotcha” with RAID 5 is how the array performs if a disk fails. This is because the data has to be reconstructed by reading all the other disks. This affected performance severely on HDD, which is why it was generally discouraged. It was even worse if you had lots of disks. If you try to keep the server online during the rebuild, don’t expect either the rebuild or the array’s performance to be good. Other performance costs included limited scalability because of the parity blocks—RAID 5 doesn’t scale well past 10 disks or so—and caching issues. Good RAID 5 performance depends heavily on the RAID controller’s cache, which can conflict with the database server’s needs. As we mentioned earlier, SSDs offer substantially improved performance in terms of IOPS and throughput, and the issues of poorly performing random read/write performance are also gone.
One of the mitigating factors for RAID 5 is that it’s so popular. As a result, RAID controllers are often highly optimized for RAID 5, and despite the theoretical limits, smart controllers that use caches well can sometimes perform nearly as well as RAID 10 controllers for some workloads. This might actually reflect that the RAID 10 controllers are less highly optimized, but regardless of the reason, this is what we’ve seen.
- RAID 6
- The largest issue with RAID 5 was that the loss of two disks was catastrophic. The more disks you have in your array, the higher the probability of disk failure. RAID 6 helps to curb the failure possibility by adding a second parity disk. This allows you to sustain two disk failures and still rebuild the array. The downside is that calculating the additional parity will make writes slower than RAID 5.
- RAID 10
-
RAID 10 is a very good choice for data storage. It consists of mirrored pairs that are striped, so it scales both reads and writes well. It is fast and easy to rebuild, in comparison to RAID 5. It can also be implemented in software fairly well.
The performance loss when one hard drive goes out can still be significant because that stripe can become a bottleneck. Performance can degrade by up to 50%, depending on the workload. One thing to watch out for is RAID controllers that use a “concatenated mirror” implementation for RAID 10. This is suboptimal because of the absence of striping: your most frequently accessed data might be placed on only one pair of disks instead of being spread across many, so you’ll get poor performance.
- RAID 50
- RAID 50 consists of RAID 5 arrays that are striped, and it can be a good compromise between the economy of RAID 5 and the performance of RAID 10 if you have many disks. This is mainly useful for very large data sets, such as data warehouses or extremely large OLTP systems.
Table 4-1 summarizes the various RAID configurations.
RAID Failure, Recovery, and Monitoring
RAID configurations (with the exception of RAID 0) offer redundancy. This is important, but it’s easy to underestimate the likelihood of concurrent disk failures. You shouldn’t think of RAID as a strong guarantee of data safety.
RAID doesn’t eliminate—or even reduce—the need for backups. When there is a problem, the recovery time will depend on your controller, the RAID level, the array size, the disk speed, and whether you need to keep the server online while you rebuild the array.
There is a chance of disks failing at exactly the same time. For example, a power spike or overheating can easily kill two or more disks. What’s more common, however, is two disk failures happening close together. Many such issues can go unnoticed. A common cause is corruption on the physical media holding data that is seldom accessed. This might go undetected for months, until either you try to read the data or another drive fails and the RAID controller tries to use the corrupted data to rebuild the array. The larger the hard drive is, the more likely this is.
That’s why it’s important to monitor your RAID arrays. Most controllers offer some software to report on the array’s status, and you need to keep track of this because you might otherwise be totally ignorant of a drive failure. You might miss your opportunity to recover the data and discover the problem only when a second drive fails, and then it’s too late. You should configure a monitoring system to alert you when a drive or volume changes to a degraded or failed status.
You can mitigate the risk of latent corruption by actively checking your arrays for consistency at regular intervals. Background Patrol Read, a feature of some controllers that checks for damaged media and fixes it while all the drives are online, can also help avert such problems. As with recovery, extremely large arrays can be slow to check, so make sure you plan accordingly when you create large arrays.
You can also add a hot spare drive, which is unused and configured as a standby for the controller to automatically use for recovery. This is a good idea if you depend on every server. It’s expensive with servers that have only a few hard drives because the cost of having an idle disk is proportionately higher, but if you have many disks, it’s almost foolish not to have a hot spare. Remember that the probability of a drive failure increases rapidly with more disks.
In addition to monitoring your drives for failures, you should monitor the RAID controller’s battery backup unit and write cache policy. If the battery fails, by default most controllers will disable write caching by changing the cache policy to write-through instead of write-back. This can cause a severe drop in performance. Many controllers will also periodically cycle the battery through a learning process, during which time the cache is also disabled. Your RAID controller’s management utility should let you view and configure when the learning cycle is scheduled so that it doesn’t catch you off guard. Newer RAID controllers avoid this by using a flash-backed cache that uses NVRAM to store uncommitted writes instead of a battery-backed cache. This avoids the entire pain of the learning cycle.
You might also want to benchmark your system with the cache policy set to write-through so you’ll know what to expect. The preferred approach is to schedule your battery learning cycles at low traffic periods, typically at night or during the weekend. If performance suffers badly enough with write-through at any time, you could also failover to another server before your learning cycle begins. As a very last resort, you could reconfigure your servers by changing the innodb_flush_log_at_trx_commit
and sync_binlog
variables to lower durability settings. This will reduce the disk utilization during write-through and may offer acceptable performance; however, this should really be done as a last resort. Reducing durability has a big impact on how much data you may lose during a database crash and your ability to recover it.
RAID Configuration and Caching
You can usually configure the RAID controller itself by entering its setup utility during the machine’s boot sequence or by running it from the command prompt. Although most controllers offer a lot of options, the two we focus on are the chunk size for striped arrays and the on-controller cache (also known as the RAID cache; we use the terms interchangeably).
The RAID stripe chunk size
The optimal stripe chunk size is workload and hardware specific. In theory, it’s good to have a large chunk size for random I/O because that means more reads can be satisfied from a single drive.
To see why this is so, consider the size of a typical random I/O operation for your workload. If the chunk size is at least that large and the data doesn’t span the border between chunks, only a single drive needs to participate in the read. But if the chunk size is smaller than the amount of data to be read, there’s no way to avoid involving more than one drive in the read.
So much for theory. In practice, many RAID controllers don’t work well with large chunks. For example, the controller might use the chunk size as the cache unit in its cache, which could be wasteful. The controller might also match the chunk size, cache size, and read-unit size (the amount of data it reads in a single operation). If the read unit is too large, its cache might be less effective, and it might end up reading a lot more data than it really needs, even for tiny requests.
It’s also hard to know whether any given piece of data will span multiple drives. Even if the chunk size is 16 KB, which matches InnoDB’s page size, you can’t be certain all of the reads will be aligned on 16 KB boundaries. The filesystem might fragment the file, and it will typically align the fragments on the filesystem block size, which is often 4 KB. Some filesystems might be smarter, but you shouldn’t count on it.
The RAID cache
The RAID cache is a (relatively) small amount of memory that is physically installed on a hardware RAID controller. It can be used to buffer data as it travels between the disks and the host system. Here are some of the reasons a RAID card might use the cache:
- Caching reads
-
After the controller reads some data from the disks and sends it to the host system, it can store the data; this will enable it to satisfy future requests for the same data without having to go to disk again.
This is usually a very poor use of the RAID cache. Why? Because the operating system and the database server have their own much larger caches. If there’s a cache hit in one of these caches, the data in the RAID cache won’t be used. Conversely, if there’s a miss in one of the higher-level caches, the chance that there’ll be a hit in the RAID cache is vanishingly small. Because the RAID cache is so much smaller, it will almost certainly have been flushed and filled with other data, too. Either way you look at it, it’s a waste of memory to cache reads in the RAID cache.
- Caching read-ahead data
- If the RAID controller notices sequential requests for data, it might decide to do a read-ahead read—that is, to prefetch data it predicts will be needed soon. It has to have somewhere to put the data until it’s requested, though. It can use the RAID cache for this. The performance impact of this can vary widely, and you should check to ensure it’s actually helping. Read-ahead operations might not help if the database server is doing its own smart read-ahead (as InnoDB does), and it might interfere with the all-important buffering of synchronous writes.
- Caching writes
- The RAID controller can buffer writes in its cache and schedule them for a later time. The advantage to doing this is twofold: first, it can return “success” to the host system much more quickly than it would be able to if it had to actually perform the writes on the physical disks, and second, it can accumulate writes and do them more efficiently.
- Internal operations
- Some RAID operations are very complex—especially RAID 5 writes, which have to calculate parity bits that can be used to rebuild data in the event of a failure. The controller needs to use some memory for this type of internal operation. This is one reason why RAID 5 can perform poorly on some controllers: it needs to read a lot of data into the cache for good performance. Some controllers can’t balance caching writes with caching for the RAID 5 parity operations.
In general, the RAID controller’s memory is a scarce resource that you should try to use wisely. Using it for reads is usually a waste, but using it for writes is an important way to speed up your I/O performance. Many controllers let you choose how to allocate the memory. For example, you can choose how much of it to use for caching writes and how much for reads. For RAID 0, RAID 1, and RAID 10, you should probably allocate 100% of the controller’s memory for caching writes. For RAID 5, you should reserve some of the controller’s memory for its internal operations. This is generally good advice, but it doesn’t always apply—different RAID cards require different configurations.
When you’re using the RAID cache for write caching, many controllers let you configure how long it’s acceptable to delay the writes (one second, five seconds, and so on). A longer delay means more writes can be grouped together and flushed to the disks optimally. The downside is that your writes will be more “bursty.” That’s not a bad thing, unless your application happens to make a bunch of write requests just as the controller’s cache fills up, when it’s about to be flushed to disk. If there’s not enough room for your application’s write requests, it’ll have to wait. Keeping the delay shorter means you’ll have more write operations and they’ll be less efficient, but it smooths out the spikiness and helps keep more of the cache free to handle bursts from the application. (We’re simplifying here—controllers often have complex, vendor-specific balancing algorithms, so we’re just trying to cover the basic principles.)
The write cache is very helpful for synchronous writes, such as issuing fsync()
calls on the transaction logs and creating binary logs with sync_binlog
enabled, but you shouldn’t enable it unless your controller has a battery backup unit (BBU) or other nonvolatile storage. Caching writes without a BBU is likely to corrupt your database, and even your transactional filesystem, in the event of power loss. If you have a BBU, however, enabling the write cache can increase performance by a factor of 20 or more for workloads that do a lot of log flushes, such as flushing the transaction log when a transaction commits.
A final consideration is that many hard drives have write caches of their own, which can “fake” fsync()
operations by lying to the controller that the data has been written to physical media. Hard drives that are attached directly (as opposed to being attached to a RAID controller) can sometimes let their caches be managed by the operating system, but this doesn’t always work either. These caches are typically flushed for an fsync()
and bypassed for synchronous I/O, but again, the hard drive can lie. You should either ensure that these caches are flushed on fsync()
or disable them because they are not battery-backed. Hard drives that aren’t managed properly by the operating system or RAID firmware have caused many instances of data loss.
For this and other reasons, it’s always a good idea to do genuine crash testing (literally pulling the power plug out of the wall) when you install new hardware. This is often the only way to find subtle misconfigurations or sneaky hard drive behaviors. A handy script for this can be found online.
To test whether you can really rely on your RAID controller’s BBU, make sure you leave the power cord unplugged for a realistic amount of time. Some units don’t last as long without power as they’re supposed to. Here again, one bad link can render your whole chain of storage components useless.
Network Configuration
Just as latency and throughput are limiting factors for a hard drive, latency and bandwidth are limiting factors for a network connection. The biggest problem for most applications is latency; a typical application does a lot of small network transfers, and the slight delay for each transfer adds up.
A network that’s not operating correctly is a major performance bottleneck, too. Packet loss is a common problem. Even 1% loss is enough to cause significant performance degradation because various layers in the protocol stack will try to fix the problems with strategies such as waiting a while and then resending packets, which adds extra time. Another common problem is broken or slow DNS resolution.1
DNS is enough of an Achilles’ heel that enabling skip_name_resolve
is a good idea for production servers. Broken or slow DNS resolution is a problem for lots of applications, but it’s particularly severe for MySQL. When MySQL receives a connection request, it does both a forward and a reverse DNS lookup. There are lots of reasons this could go wrong. When it does, it will cause connections to be denied, slow down the process of connecting to the server, and generally wreak havoc, up to and including denial-of-service attacks. If you enable the skip_name_resolve
option, MySQL won’t do any DNS lookups at all. However, this also means that your user accounts must have only IP addresses, “localhost,” or IP address wildcards in the host
column. Any user account that has a hostname in the host
column will not be able to log in.
It’s usually more important, though, to adjust your settings to deal efficiently with a lot of connections and small queries. One of the more common tweaks is to change your local port range. Linux systems have a range of local ports that can be used. When the connection is made back to a caller, it uses a local port. If you have many simultaneous connections, you can run out of local ports.
Here’s a system that is configured to default values:
$ cat /proc/sys/net/ipv4/ip_local_port_range 32768 61000
Sometimes you might need to change these values to a larger range. For example:
$ echo 1024 65535 > /proc/sys/net/ipv4/ip_local_port_range
The TCP protocol allows a system to queue up incoming connections, like a bucket. If the bucket fills up, clients won’t be able to connect. You can allow more connections to queue up as follows:
$ echo 4096 > /proc/sys/net/ipv4/tcp_max_syn_backlog
For database servers that are used only locally, you can shorten the timeout that comes after closing a socket in the event that the peer is broken and doesn’t close its side of the connection. The default is one minute on most systems, which is rather long:
$ echo <value> > /proc/sys/net/ipv4/tcp_fin_timeout
Most of the time, these settings can be left at their defaults. You’ll typically need to change them only when something unusual is happening, such as extremely poor network performance or very large numbers of connections. An Internet search for “TCP variables” will turn up lots of good reading about these and many more variables.
Choosing a Filesystem
Your filesystem choices are pretty dependent on your operating system. In many systems, such as Windows, you really have only one or two choices, and only one (NTFS) is really viable. GNU/Linux, on the other hand, supports many filesystems.
Many people want to know which filesystems will give the best performance for MySQL on GNU/Linux or, even more specifically, which of the choices is best for InnoDB. The benchmarks actually show that most of them are very close in most respects, but looking to the filesystem for performance is really a distraction. The filesystem’s performance is very workload specific, and no filesystem is a magic bullet. Most of the time, a given filesystem won’t perform significantly better or worse than any other filesystem. The exception is if you run into some filesystem limit, such as how it deals with concurrency, working with many files, fragmentation, and so on.
Overall, you’re best off using a journaling filesystem, such as ext4, XFS, or ZFS. If you don’t, a filesystem check after a crash can take a long time.
If you use ext3 or its successor, ext4, you have three options for how the data is journaled, which you can place in the /etc/fstab mount options:
data=writeback
-
This option means only metadata writes are journaled. Writes to the metadata are not synchronized with the data writes. This is the fastest configuration, and it’s usually safe to use with InnoDB because it has its own transaction log. The exception is that a crash at just the right time could cause corruption in a .frm file on a pre-8.0 version of MySQL.
Here’s an example of how this configuration could cause problems. Say a program decides to extend a file to make it larger. The metadata (the file’s size) will be logged and written before the data is actually written to the (now larger) file. The result is that the file’s tail—the newly extended area—contains garbage.
data=ordered
- This option also journals only the metadata, but it provides some consistency by writing the data before the metadata so it stays consistent. It’s only slightly slower than the
writeback
option, and it’s much safer when there’s a crash. In this configuration, if we suppose again that a program wants to extend a file, the file’s metadata won’t reflect the file’s new size until the data that resides in the newly extended area has been written. data=journal
- This option provides atomic journaled behavior, writing the data to the journal before it’s written to the final location. It is usually unnecessary and has much higher overhead than the other two options. However, in some cases it can improve performance because the journaling lets the filesystem delay the writes to the data’s final location.
Regardless of the filesystem, there are some specific options that it’s best to disable because they don’t provide any benefit and can add quite a bit of overhead. The most famous is recording access time, which requires a write even when you’re reading a file or directory. To disable this option, add the noatime,nodiratime
mount options to your /etc/fstab; this can sometimes boost performance by as much as 5%–10%, depending on the workload and the filesystem (although it might not make much difference in other cases). Here’s a sample /etc/fstab line for the ext3 options we mentioned:
/dev/sda2 /usr/lib/mysql ext3 noatime,nodiratime,data=writeback 0 1
You can also tune the filesystem’s read-ahead behavior because it might be redundant. For example, InnoDB does its own read-ahead prediction. Disabling or limiting read-ahead is especially beneficial on Solaris’s UFS. Using innodb_flush_method=O_DIRECT
automatically disables read-ahead.
Some filesystems don’t support features you might need. For example, support for direct I/O might be important if you’re using the O_DIRECT
flush method for InnoDB. Also, some filesystems handle a large number of underlying drives better than others; XFS is often much better at this than ext3, for instance. Finally, if you plan to use Logical Volume Manager (LVM) snapshots for initializing replicas or taking backups, you should verify that your chosen filesystem and LVM version work well together.
Table 4-2 summarizes the characteristics of some common filesystems.
Filesystem | Operating system | Journaling | Large directories |
---|---|---|---|
ext3 | GNU/Linux | Optional | Optional/partial |
ext4 | GNU/Linux | Yes | Yes |
Journaled File System (JFS) | GNU/Linux | Yes | No |
NTFS | Windows | Yes | Yes |
ReiserFS | GNU/Linux | Yes | Yes |
UFS (Solaris) | Solaris | Yes | Tunable |
UFS (FreeBSD) | FreeBSD | No | Optional/partial |
UFS2 | FreeBSD | No | Optional/partial |
XFS | GNU/Linux | Yes | Yes |
ZFS | GNU/Linux, Solaris, FreeBSD | Yes | Yes |
We usually recommend using the XFS filesystem. The ext3 filesystem just has too many serious limitations, such as its single mutex per inode, and bad behavior, such as flushing all dirty blocks in the whole filesystem on fsync()
instead of just one file’s dirty blocks. The ext4 filesystem is an acceptable choice, although there have been performance bottlenecks in specific kernel versions that you should investigate before committing to it.
When considering any filesystem for a database, it’s good to consider how long it has been available, how mature it is, and how proven it has been in production environments. The filesystem bits are the very lowest level of data integrity you have in a database.
Choosing a Disk Queue Scheduler
On GNU/Linux, the queue scheduler determines the order in which requests to a block device are actually sent to the underlying device. The default is Completely Fair Queuing, or cfq
. It’s okay for casual use on laptops and desktops, where it helps prevent I/O starvation, but it’s terrible for servers. It causes very poor response times under the types of workload that MySQL generates because it stalls some requests in the queue needlessly.
You can see which schedulers are available and which one is active with the following command:
$ cat /sys/block/sda/queue/scheduler noop deadline [cfq]
You should replace sda
with the device name of the disk you’re interested in. In our example, the square brackets indicate which scheduler is in use for this device. The other two choices are suitable for server-class hardware, and in most cases they work about equally well. The noop
scheduler is appropriate for devices that do their own scheduling behind the scenes, such as hardware RAID controllers and storage area networks (SANs), and deadline
is fine for both RAID controllers and disks that are directly attached. Our benchmarks show very little difference between these two. The main thing is to use anything but cfq
, which can cause severe performance problems.
Memory and Swapping
MySQL performs best with a large amount of memory allocated to it. As we learned in Chapter 1, InnoDB uses memory as a cache to avoid disk access. This means that the performance of the memory system can have a direct impact on how fast queries are served. Even today, one of the best ways to ensure faster memory access has been to replace the built-in memory allocator (glibc
) with an external one such as tcmalloc
or jemalloc
. Numerous benchmarks2 have shown that both of these offer improved performance and reduced memory fragmentation when compared with glibc
.
Swapping occurs when the operating system writes some virtual memory to disk because it doesn’t have enough physical memory to hold it. Swapping is transparent to processes running on the operating system. Only the operating system knows whether a particular virtual memory address is in physical memory or on disk.
When using SSDs, the performance penalty isn’t nearly as sharp as it used to be with HDDs. You should still actively avoid swapping—even if just to avoid unnecessary writes that may shorten the overall life span of the disk. You may also consider taking the approach of using no swap, which forgoes the potential altogether but does put you in a situation where running out of memory may lead to process termination.
On GNU/Linux, you can monitor swapping with vmstat (we show some examples in the next section). You need to look at the swap I/O activity, reported in the si
and so
columns, rather than the swap usage, which is reported in the swpd
column. The swpd
column can show processes that have been loaded but aren’t being used, which are not really problematic. We like the si
and so
column values to be 0
, and they should definitely be less than 10 blocks per second.
In extreme cases, too much memory allocation can cause the operating system to run out of swap space. If this happens, the resulting lack of virtual memory can crash MySQL. But even if it doesn’t run out of swap space, very active swapping can cause the entire operating system to become unresponsive, to the point that you can’t even log in and kill the MySQL process. Sometimes the Linux kernel can even hang completely when it runs out of swap space. We recommend you run your databases without using swap space at all. Disk is still an order of magnitude slower than RAM, and this avoids all of the headaches mentioned here.
Another thing that frequently happens under extreme virtual memory pressure is that the out-of-memory (OOM) killer process will kick in and kill something. This is frequently MySQL, but it can also be another process such as SSH, which can leave you with a system that’s not accessible from the network. You can prevent this by setting the SSH process’s oom_adj
or oom_score_adj
value. When working with dedicated database servers, we highly recommend that you identify any key processes like MySQL and SSH and proactively adjust the OOM killer score to prevent those from being selected first for termination.
You can solve most swapping problems by configuring your MySQL buffers correctly, but sometimes the operating system’s virtual memory system decides to swap MySQL anyway, sometimes related to how nonuniform memory access (NUMA) works3 in Linux. This usually happens when the operating system sees a lot of I/O from MySQL, so it tries to increase the file cache to hold more data. If there’s not enough memory, something must be swapped out, and that something might be MySQL itself. Some older Linux kernel versions also have counterproductive priorities that swap things when they shouldn’t, but this has been alleviated a bit in more recent kernels.
Operating systems usually allow some control over virtual memory and I/O. We mention a few ways to control them on GNU/Linux. The most basic is to change the value of /proc/sys/vm/swappiness to a low value, such as 0
or 1
. This tells the kernel not to swap unless the need for virtual memory is extreme. For example, here’s how to check the current value:
$ cat /proc/sys/vm/swappiness 60
The value shown, 60, is the default swappiness setting (the range is from 0 to 100). This is a very bad default for servers. It’s only appropriate for laptops. Servers should be set to 0
:
$ echo 0 > /proc/sys/vm/swappiness
Another option is to change how the storage engines read and write data. For example, using innodb_flush_method=O_DIRECT
relieves I/O pressure. Direct I/O is not cached, so the operating system doesn’t see it as a reason to increase the size of the file cache. This parameter works only for InnoDB.
Another option is to use MySQL’s memlock
configuration option, which locks MySQL in memory. This will avoid swapping, but it can be dangerous: if there’s not enough lockable memory left, MySQL can crash when it tries to allocate more memory. Problems can also be caused if too much memory is locked and there’s not enough left for the operating system.
Many of the tricks are specific to a kernel version, so be careful, especially when you upgrade. In some workloads, it’s hard to make the operating system behave sensibly, and your only recourse might be to lower the buffer sizes to suboptimal values.
Operating System Status
Your operating system provides tools to help you find out what the operating system and hardware are doing. In this section, we’ll show you examples of how to use two widely available tools, iostat and vmstat. If your system doesn’t provide either of these tools, chances are it will provide something similar. Thus, our goal isn’t to make you an expert at using iostat or vmstat but simply to show you what to look for when you’re trying to diagnose problems with tools such as these.
In addition to these tools, your operating system might provide others, such as mpstat or sar. If you’re interested in other parts of your system, such as the network, you might want to use tools such as ifconfig (which shows how many network errors have occurred, among other things) or netstat instead.
By default, vmstat and iostat produce just one report showing the average values of various counters since the server was started, which is not very useful. However, you can give both tools an interval argument. This makes them generate incremental reports showing what the server is doing right now, which is much more relevant. (The first line shows the statistics since the system was started; you can just ignore this line.)
How to read vmstat output
Let’s look at an example of vmstat first. To make it print out a new report every five seconds, reporting sizes in megabytes, use the following command:
$ vmstat -SM 5 procs -------memory------- -swap- -----io---- ---system---- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 11 0 0 2410 4 57223 0 0 9902 35594 122585 150834 10 3 85 1 0 10 2 0 2361 4 57273 0 0 23998 35391 124187 149530 11 3 84 2 0
You can stop vmstat with Ctrl-C. The output you see depends on your operating system, so you might need to read the manual page to figure it out.
As stated earlier, even though we asked for incremental output, the first line of values shows the averages since the server was booted. The second line shows what’s happening right now, and subsequent lines will show what’s happening at five-second intervals. The columns are grouped by one of the following headers:
- procs
- The
r
column shows how many processes are waiting for CPU time. Theb
column shows how many are in uninterruptible sleep, which generally means they’re waiting for I/O (disk, network, user input, and so on). - memory
- The
swpd
column shows how many blocks are swapped out to disk (paged). The remaining three columns show how many blocks arefree
(unused), how many are being used for buffers (buff), and how many are being used for the operating system’scache
. - swap
- These columns show swap activity: how many blocks per second the operating system is swapping in (from disk) and out (to disk). They are much more important to monitor than the
swpd
column. We like to seesi
andso
at0
most of the time, and we definitely don’t like to see more than 10 blocks per second. Bursts are also bad. - io
- These columns show how many blocks per second are read in from (
bi
) and written out to (bo
) block devices. This usually reflects disk I/O. - system
- These columns show the number of interrupts per second (
in
) and the number of context switches per second (cs
). - cpu
- These columns show the percentages of total CPU time spent running user (nonkernel) code, running system (kernel) code, idle, and waiting for I/O. A possible fifth column (
st
) shows the percent “stolen” from a virtual machine if you’re using virtualization. This refers to the time during which something was runnable on the virtual machine, but the hypervisor chose to run something else instead. If the virtual machine doesn’t want to run anything and the hypervisor runs something else, that doesn’t count as stolen time.
The vmstat output is system dependent, so you should read your system’s vmstat(8)
manpage if yours looks different from the sample we’ve shown.
How to read iostat output
Now let’s move on to iostat. By default, it shows some of the same CPU usage information as vmstat. We’re usually interested in just the I/O statistics, though, so we use the following command to show only extended device statistics:
$ iostat -dxk 5 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s sda 0.00 0.00 1060.40 3915.00 8483.20 42395.20 avgrq-sz avgqu-sz await r_await w_await svctm %util 20.45 3.68 0.74 0.57 0.78 0.20 98.22
As with vmstat, the first report shows averages since the server was booted (we generally omit it to save space), and the subsequent reports show incremental averages. There’s one line per device.
There are various options that show or hide columns. The official documentation is a bit confusing, and we had to dig into the source code to figure out what was really being shown. Here’s what each column is showing:
rrqm/s
andwrqm/s
- The number of merged read and write requests queued per second. Merged means the operating system took multiple logical requests from the queue and grouped them into a single request to the actual device.
r/s
andw/s
- The number of read and write requests sent to the device per second.
rkB/s
andwkB/s
- The number of kilobytes read and written per second.
avgrq-sz
- The request size in sectors.
avgqu-sz
- The number of requests waiting in the device’s queue.
await
- The number of milliseconds spent in the disk queue.
r_await
andw_await
- The average time in milliseconds for read requests issued to the device to be served, for both reads and writes, respectively. This includes the time spent by the requests in queue and the time spent servicing them.
svctm
- The number of milliseconds spent servicing requests, excluding queue time.
%util
4- The percentage of time during which at least one request was active. This is very confusingly named. It is not the device’s utilization, if you’re familiar with the standard definition of utilization in queuing theory. A device with more than one hard drive (such as a RAID controller) should be able to support a higher concurrency than 1, but
%util
will never exceed 100% unless there’s a rounding error in the math used to compute it. As a result, it is not a good indication of device saturation, contrary to what the documentation says, except in the special case where you’re looking at a single physical hard drive.
You can use the output to deduce some facts about a machine’s I/O subsystem. One important metric is the number of requests served concurrently. Because the reads and writes are per second and the service time’s unit is thousandths of a second, you can use Little’s law to derive the following formula for the number of concurrent requests the device is serving:
concurrency = (r/s + w/s) * (svctm/1000)
Plugging the preceding sample numbers into the concurrency formula gives a concurrency of about 0.995. This means that on average, the device was serving less than one request at a time during the sampling interval.
Other Helpful Tools
We’ve shown vmstat and iostat because they’re widely available, and vmstat is usually installed by default on many Unix-like operating systems. However, each of these tools has its limitations, such as confusing units of measurement, sampling at intervals that don’t correspond to when the operating system updates the statistics, and the inability to see all of the metrics at once. If these tools don’t meet your needs, you might be interested in dstat or collectl.
We also like to use mpstat to watch CPU statistics; it provides a much better idea of how the CPUs are behaving individually, instead of grouping them all together. Sometimes this is very important when you’re diagnosing a problem. You might find blktrace to be helpful when you’re examining disk I/O usage, too.
Percona wrote its own replacement for iostat called pt-diskstats. It’s part of Percona Toolkit. It addresses some of the complaints about iostat, such as the way it presents reads and writes in aggregate and the lack of visibility into concurrency. It is also interactive and keystroke driven, so you can zoom in and out, change the aggregation, filter out devices, and show and hide columns. It is a great way to slice and dice a sample of disk statistics, which you can gather with a simple shell script even if you don’t have the tool installed. You can capture samples of disk activity and email or save them for later analysis.
Lastly, perf, the Linux profiler, is an invaluable tool for inspecting what is going on at the operating system level. You can use perf to inspect general information about the operating system, such as why the kernel is using CPU so much. You can also inspect specific process IDs, allowing you to see how MySQL is interacting with the operating system. Inspecting system performance is a very deep dive, so we recommend Systems Performance, Second Edition by Brendan Gregg (Pearson) as excellent follow-up reading.
Summary
Choosing and configuring hardware for MySQL, and configuring MySQL for the hardware, is not a mystical art. In general, you need the same skills and knowledge that you need for most other purposes. However, there are some MySQL-specific things you should know.
What we commonly suggest for most people is to find a good balance between performance and cost. First, we like to use commodity servers, for many reasons. For example, if you’re having trouble with a server and you need to take it out of service while you try to diagnose it, or if you simply want to try swapping it with another server as a form of diagnosis, this is a lot easier to do with a $5,000 server than one that costs $50,000 or more. MySQL is also typically a better fit—both in terms of the software itself and in terms of the typical workloads it runs—for commodity hardware.
The four fundamental resources MySQL needs are CPU, memory, disk, and network resources. The network doesn’t tend to show up as a serious bottleneck very often, but CPUs, memory, and disks certainly do. The balance of speed and quantity really depends on the workload, and you should strive for a balance of fast and many as your budget allows. The more concurrency you expect, the more you should lean on more CPUs to accommodate your workload.
The relationship between CPUs, memory, and disks is intricate, with problems in one area often showing up elsewhere. Before you throw resources at a problem, ask yourself whether you should be throwing resources at a different problem instead. If you’re I/O bound, do you need more I/O capacity, or just more memory? The answer hinges on the working set size, which is the set of data that’s needed most frequently over a given duration.
Solid-state devices are great for improving server performance overall and should generally be the standard for databases now, especially OLTP workloads. The only reason to continue using HDDs is in extremely budget-constrained systems or ones where you need a staggeringly high amount of disk space—on the order of petabytes in a data-warehousing situation.
In terms of the operating system, there are just a few Big Things that you need to get right, mostly related to storage, networking, and virtual memory management. If you use GNU/Linux, as most MySQL users do, we suggest using the XFS filesystem and setting the swappiness and disk queue scheduler to values that are appropriate for a server. There are some network parameters that you might need to change, and you might wish to tweak a number of other things (such as disabling SELinux), but those changes are a matter of preference.
1 Popular haiku: It’s not DNS. There’s no way it’s DNS. It was DNS.
2 See the blog posts “Impact of Memory Allocators on MySQL Performance” and “MySQL (or Percona) Memory Usage Tests” for comparisons.
3 See this blog post for more.
4 Software RAID, like MD/RAID, may not show utilization for the RAID array itself.
Get High Performance MySQL, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.