First thoughts regarding the MySQL Falcon storage engine
By Osma on Thursday 18 January 2007, 17:40 - Permalink
One of my DBA colleagues mentioned that MySQL has released the first alpha version of the Falcon storage engine, which is advertised to most efficiently utilise modern hardware to provide a high-performance scalable replacement for InnoDB, which MySQL naturally tries to reduce dependency of.
Unfortunately, just based on reading the Falcon documentation, I must draw the conclusion that without extensive further development, it won't be usable for very large installations such as the ones we run for Habbo for a number of reasons. I'm usually much more positive about MySQL, it after all being technology that has enabled Habbo to grow more than 100% every year I've been working on it, but this is a disappointment.
It supports just one tablespace per database, and each tablespace stores all data in a single file. While the concurrency problems of single-file access can be eliminated with careful application of modern kernel, filesystem and disk subsystem technology, single file databases still suffer from major administration issues.
Since a database can't be extended by additional tablespaces and data migrated by the storage engine, you'd better trust your capacity to indefinitely increase available storage space under one filesystem or downtime can't ever become a problem for you. Don't even think about deploying Falcon without a high-end NAS device that supports many times your current storage requirements, reliable logical volume management and an extendable file system. A database is also limited by the filesystem's maximum file size, so make sure that won't be a problem either. I wouldn't recommend ext3 for Falcon.
You'll also need to make backups either via SQL dumping the entire database (not really feasible for daily routine) or by backing up a single file, so either your filesystem, LVM system or storage device must support snapshot backups. Scalability may still become an issue, so be sure that the approach you choose doesn't degrade performance as file size grows.
Just one thread writing to disk may at first blush sound like excellent
performance maximisation technique, but it forces you to make a choice between
reliability (since it applied to log writes too, transactions are committed to
disk in a serialized fashion - no concurrency) and scalability ("commits" to
ram cache and background disk flushes certainly will perform well and scale
nicely, but what if there's a power failure?). And this is not even the road to
highest possible performance - the highest-end disk subsystems will become CPU
limited if only one thread will be able to send I/O requests.
With one table space comes one cache/buffer pair, so developers are either
forced to split their data model to multiple logical databases or suffer under
one unpartitionable system where one bad table scan by one part of the
application wipes the buffers from underneath the entire application. A truly
modern storage system permits the DBA to assign certain tables or indices to
their own caches and buffer spaces and retain a single logical model for
software developers. MySQL has never had this ability, and apparently Falcon
won't bring it, either.
A more traditional DBA might also cringe at the statement "it is impossible to predict or calculate the disk storage space required for a specific dataset." Many, many complaints could be made about the alpha-release's other restrictions, but I'll give MySQL the chance to keep their promise to address them in forthcoming versions.
I don't really understand which of its features qualify it as technology that utilises modern computers to the best possible effect. Perhaps they're referring to it automatically compressing data on disk? Sure, that may be useful, but it may just as well become a bottleneck when single-row updates require entire pages to be recompressed. Just that feature alone doesn't impress me. It's not more easily administrated, nor does it (on paper at least) address this kind of performance issues. At best, it's an upgrade to MyISAM, but shouldn't be mistaken for a solution to high-performance transactional database requirements.
More on it once I've had a chance to do some practical experimentation (might be a while).
Update: It seems Peter Zaitzev has benchmarked Falcon against MySQL's other storage engines, verifying my suspicion that it doesn't scale properly. Do note that neither MyISAM nor InnoDB show ideal scaling performance either.
Comments
Don't write it off yet -- it's in early alpha. For certain workloads it may be ideal after it becomes more mature. Of course, Mr. Starkey has a very particular way of doing things, and certain goals -- so who knows what that workload will be.
Thanks - I realize I'm commenting an alpha, and worse yet, based on its docs instead of the code. Yet I can't help but think that some of the underlying choices in the entire thing make it at best suited to a small-scale deployment (both in terms of maintainability and performance).
First, thanks very much for your comments - it's only by getting this type of feedback that we can work to make things better. Next, some things for Falcon - such as the actual tablespace architecture - are not set in stone yet, so now is the time to get valuable input such as yours and see what can be done for v1 of Falcon. If you would like to participate and review the design plans, please contact me and I'll make sure that happens.
Thank you for taking the time to review Falcon. Early feedback is very valuable both as a critique of the design and as an indication of user priorities. Still, this is the first alpha release. Significant additional features are still under development and others in design.
Falcon is designed around what we now call tablespaces (internally, they’re called databases). A Falcon tablespace is a single file containing user tables, indexes, blobs, sequences, system tables, space management information, etc. For first alpha, there is a Falcon tablespace per MySQL schema/database. Before beta, we will shift to a model of a single Falcon tablespace per server (similar to InnoDB). When we straighten out the server issues, we will expose the option of explicit tablespaces with placement control. I believe this will give you exactly what you want. If not, I’d like to hear about it.
You are probably aware that much work is underway for online backups. We certainly plan to tie Falcon into that long before the official Falcon beta. We’re not there yet, but again, this is only an alpha.
The documentation is also still in alpha and may not reflect a full understanding of Falcon. Specifically, disk writes are not queued to a single thread. All server threads write to the serial log. At commit time, the threads either flush the log themselves or join the group waiting for the last flush to finish. In theory any server thread could also force the cache manager to write a dirty page to free a buffer in the page cache. In practice, the cache management scheme makes that event unlikely. The scheduler thread writes to tablespace file when checkpointing the page cache. The page writer thread forces large blobs to the tablespace file prior to commit. Neither of these is critical to performance. Only serial log writes are time critical.
Falcon performs prefix and suffix compression plus variable length binary representation in indexes. Otherwise, Falcon does not compress data in the classic sense. Instead, Falcon has a dense record encoding based on field values not declarations. The last time I studied the issue, the elapsed time to decompress a page with zlib was longer than the time to read the page off disk. Even disregarding the performance cost, large blobs are often PDFs and JPEGs, which are compressed internally and do not shrink with further compression. I reserve the right to change my mind in the future, but at this point, we don’t have a suitable compression scheme for database use.
The internal design of Falcon was driven by two factors: The availability of very large, cheap memory and SMP. An update in transaction operates in memory only. When a thread commits, it writes its records and index updates to the serial log with a single atomic operation. That single write can commit many parallel transactions. When that write completes, it ends the transaction or transactions. They are now durable. The work is not finished; the record and index updates must be merged to the database on disk. This task is not time-critical. The so-called “gopher” thread reads the log and integrates its contents into the tablespace file. The bottom line is that Falcon pumps transaction at memory speed, pausing only for a single serial log write to commit the data to persistent oxide. Other examples of modern design include fine grain threading to maximize use of multiple processors, user mode read/write thread synchronization primitives to minimize contention among threads, and non-blocking management of critical data structures with user mode interlocked instructions.
Peter Zaitzev’s numbers were indeed interesting. They got my undivided attention. Upon analysis, they demonstrated that there were avoidable thread bottlenecks when a large number of parallel auto-commit threads executed a trivial query in a hard loop. I am pleased to report that the Falcon alpha update, now in progress, performs four to six times faster on Peter’s primary key benchmark. That doesn’t make Falcon four to six times faster generally, but it now runs badly constructed benchmarks much faster. (There is a lesson here for people who follow benchmarks.) The changes didn’t make Falcon any slower running real life applications, so I guess we all win.
Again, I appreciate your work and value your comments. Feedback always makes the product better.