Fishpool

To content | To menu | To search

Tag - Performance

Entries feed - Comments feed

Wednesday 19 November 2008

Looking for a ETL engineer for our BI team

So, I mentioned earlier that I was looking at Infobright's Brighthouse technology as a storage backend for heaps and heaps of traffic and user data from Habbo. Turns out it works fine (now that it's in V3 and supports more of the SQL semantics), and we took it into use. Been pretty happy with that, and I expect to talk more about the challenge and our solution at the next MySQL Conference in April 2009.

However, our DWH team needs extra help. If you're interested in solving business analytics problems by processing lots of data and the idea of working in a company that leads the virtual worlds industry excites you, let us know by sending us an application. Thanks for reading!

Monday 15 September 2008

Infobright BI tools go open source

I've mentioned Infobright before as an interesting solution to getting more performance to BI analytics solutions. Today's news are interesting: Sun invests in the company, and the baseline product is open sourced. Too busy to write more about it today, but I'm certainly watching this one closely.

Monday 31 March 2008

Optimizing Linux for random I/O on hardware RAID

There's a relatively little known feature about Linux IO scheduling that has a pretty significant effect in large scale database deployments at least with MySQL that a recent article on MySQL Performance Blog prompted me to write about. This may have an effect on other databases and random I/O systems as well, but we've definitely seen it with MySQL 5.0 on RHEL 4 platform. I have not studied this on RHEL 5, and since the IO subsystem and the Completely Fair Queue scheduler that is default on RHEL kernels has received further tuning since, I can not say if it still exists.

Though I've heard YouTube discovered these same things, I have not yet seen a simple explanation of why this is so - so I'll take a shot at explaining it.

In short, a deployment with a RAID controller or external storage system visible to the operating system as a single block device will not reach its maximum performance under RHEL default settings, and can be easily coaxed about 20% higher on average random I/O (and significantly higher in spot benchmarks) with a single kernel parameter (elevator=noop) or equivalent runtime tuning via /sys/block/*/queue/scheduler in RHEL5, where you can also set this on a per-device basis.

We first saw this in 2005 on a quad-CPU server with a RAID controller connected to 10 SCSI disks. At that time, we found that configuring the RAID to expose five RAID-1 pairs which we then striped to a single volume using LVM increased performance despite making the OS and CPU do more work on I/O. The difference in performance was about 20%.

Our most recent proof of the same effect was a quad-CPU server connected to a NetApp storage system over FC. Since it was not convenient to expose multiple volumes from the NetApp to stripe them together, we searched for other solutions, and prompted by a presentation by the YouTube engineers looked at the I/O scheduling options and found a simple way to improve performance was to turn off I/O reordering by the kernel. Again, the overall impact between the settings was about 20%, though at times much greater.

The lesson is simple: reordering I/O requests multiple times provides no benefits, and reordering them too early will in fact be detrimental. Explaining why that is so is a bit involving, and is based on a few assumptions we have not bothered to verify, since the empirical results have supported our conclusions and got us where we wanted.

In order to keep the explanation simple, I will describe it conceptually on a very small scale. When reading this, please take this into account and understand that to measure the effect we have seen in practice, the size of the solution should be increased from what I am describing.

First, consider the case of direct-attached storage exposed to the Linux kernel as independent devices. In this configuration, the kernel maintains a per-device I/O queue, and the CFQ scheduler will reorder I/O requests to each device separately in order to maintain fair per-process balancing, low latency and high throughput. This is the configuration in which CFQ does a great job of maximizing performance, and works fairly well with any amount of spindles. As the application (a database in this case) fires random I/O, each of the spindles is executing them independently and serves requests as soon as they are issued. In other words, the system is good at keeping each of the I/O queues "hot". The sustained top I/O rate is roughly linear to the number of spindles, or with 15k rpm drives, about 1000 ops for four drives.

Now, lets introduce a hardware RAID of some sort, in particular one which is enabled to further reorder operations thanks to a "big" battery-backed up cache. Thanks to that cache, the RAID can commit thousands of write operations per second for fairly long periods (seconds), flushing them to disk after merging. On the other hand, the kernel now sees just one device, and has one I/O queue to it. The CFQ scheduler sits in front of this queue, reordering pending I/O requests. All is fine until the I/O pressure rises up to about what a single spindle can process on a sustained basis, or about 250 requests per second on those 15k drives. However, as soon as the queue starts building up, the CFQ scheduler kicks into action, and reorders the queue from random to sorted per block number (an oversimplification, but close enough).

All is good? No, it's not. The sequential blocks on that RAID volume are not truly sequential, but reside on different spindles and could thus be processed simultaneously. To demonstrate, lets assume your four-spindle array has one billion sectors or five hundred gigs per device, and further, that it is striped at 64k extents or 7.8 million stripes across each device.

On both configurations, the striping is essentially the same. Every 128 sectors or 64k is one one device, then the next one, and so on. The difference is that with LVM in place, the kernel knows this, while with the RAID, it has no idea of the layout of the array, essentially treating it as a single spindle.

Now, those couple of thousand request that were just issued, contain sequences such as writes to sectors 10, 200, 50, 300, 1020, 600, 1500 and 700. Due to the striping, four of these can be executed simultaneously, so the optimal order to issue these, of course depending on what else might be going on, is something like 10, 200, 300, 1500, 50, 700, 1020, and 600, executed through four queues: [10, 50, 600], [200, 700], [300] and [1020, 1500]. In the LVM configuration this might be what really happens. However, the single I/O queue to the RAID device will have these sorted into ascending block order, and with enough such operations in the queue, the RAID processor no longer has enough view to the queue to efficiently re-re-order them to utilize all the spindles, so only some of them are hot at any given time. TCQ should help, but in practice it won't issue enough outstanding requests to fix the problem. In our experience the top sustained rate is not more than 1.5 times one spindle, or 300-400 requests per second, while the array should really run at over the 1000 ops per second thanks to the additional persistent cache on the RAID controller.

Bottom line: CFQ is great, but only if the kernel actually knows everything about the physical layout of the media. It also looks like some of the recently introduced tuning parameters (which I know nothing about, just noted their appearance) might help avoid the worst hit. However, ultimately it doesn't matter - if your hardware allows efficient "outsourcing" of the I/O scheduling to a large secure cache, use it, and don't bother making the kernel do the job without all the information.

I hope this explanation makes sense, and that I haven't botched any important details or made wrong assumptions. Please comment if any of this is inaccurate.

PS. A tuning guide for Oracle recommends the deadline scheduler due to latency guarantees. We have not benchmarked that against noop.

Sunday 7 October 2007

MySQL and materialized views

I'm working on alternative strategies to make the use and maintenance of a multi-terabyte data warehouse implementation tolerably fast. For example, it's clear that a reporting query on a 275-million row table is not going to be fun by anyone's definition, but that for most purposes, it can be pre-processed to various aggregated tables of significantly smaller sizes.

However, what is not obvious is what would be the best strategy for creating those tables. I'm working with MySQL 5.0 and Business Objects' Data Integrator XI, so I have a couple of options.

I can just CREATE TABLE ... SELECT ... to see how things work out. This approach is simple to try, but essentially unmaintanable; no good.

I can define the process as a BODI data flow. This is good in many respects, as it creates a documented flow of how the aggregates are updated, is fairly easy to hook up to the workflows which pull in new data from source systems, and allows monitoring of the update processes. However, it's also quite work intensive to create all those objects with the "easy" GUIs in comparison to just writing a a few simple SQL statements. There are also some SQL constructs that are horribly complicated to express in BODI; in particular, COUNT(DISTINCT ..) is ugly.

Or I could create the whole process with views on the original fact table, with triggered updates of a materialized view table in the database. It would still be fairly nicely documentable, thanks to the straightforward structure of the views, and very maintanable, as the updates would be automatic. A deferred update mechanism with a trigger keeping track of which part of the materialized view needs update and a periodic refresh over a stored procedure would keep things nicely in sync. MySQL 5.0 even has all of the necessary functionality.

Except.. It's only there in theory. The performance of views and triggers is so horrible that any such implementation would totally destroy the usability of the system. MySQL's views only work as statement merge when there is a one-to-one relationship between base table and view rows, or in other words, the view can not contain SUM(), AVG(), COUNT() or any of the other mechanisms which would have been the whole point of the materialized view in question. It will fall back to a temp table implementation in these cases, and creating a GROUP BY temp table over 275 million rows without using the WHERE BY clause is pure madness.

In addition, defining any triggers, however simple, slow bulk loads to the base tables by an order of magnitude. I could of course still work around triggers by implementing the equivalent logging in each BODI workflow and create the materialized views and a custom stored proc to update each one, but having a view there in between was the only way to make this approach maintainable. Damn, there goes that strategy.

Wednesday 29 August 2007

My Top 5 wishlist for MySQL

I (belatedly) noticed a meme running on Planet MySQL regarding wishlist items for the company. I think it started with Jay Pipes and Mårten Mickos, but has since moved on to users. In particular, I'd endorse most of Jeremy Cole's and Ronald Bradford's wishes myself as well.

But let me jump on the bandwagon and offer my view of the things that would most help us run and develop our services.

1. Online table changes. Ronald mentioned this as well, but I have to emphasize this more: for all the good that InnoDB did in terms of eliminating table-level locks for INSERTs and UPDATEs, it has caused almost more pain for us in terms of locks during schema updates. Until you've tried it yourself, you can't imagine the pain of running an ALTER TABLE ADD INDEX on a 150-million row table during a routine application upgrade that would otherwise be over in 10 minutes...

2. Reliable baseline functionality, including replication. This is the big one: MySQL 5.0 improved many things, but one thing it didn't improve is dependability – version 4.1 was far more stable. Maybe we're pushing the boundaries, but when you're managing 20+ database servers and several terabytes of data, what you don't want to start your days with is a check of which slaves have stopped updating and which processes you need to restart and scheduled maintenance jobs to run again.

3. A smarter query planner. In addition to our normal application OLTP-style development, we're currently busy working on a multi-terabyte DWH project. MySQL happily proceeds to execute three-table JOIN queries sequentially scanning 200 million row facts when it can't quite figure out whether a query selects 20 or 200 rows from a dimension table. Oops. Please come back after lunch...

4. Index assistance. While most of Microsoft's wizards are not very useful, SQL Server has for years had one that is really nice; one that captures all queries against a database and evaluates whether new indexes would improve performance, and whether existing indexes are helping. It's really cumbersome to do this with the slow_query_log and analysing EXPLAIN output.. Especially since that output isn't all that detailed (see previous)...

5. Runtime-changeable InnoDB and logging parameters. Sometimes finding the root of a performance issue is a hit-and-miss job of looking for slow queries, parallel updates, buffer pool settings, etc parameters, but many of the settings you need to try require a restart to take effect. Not only does that make it impossible to even contemplate fixing something in a live installation, running benchmarks in a test deployment is also a slow and cumbersome undertaking.

Some earlier time I might have mentioned on this list also integrated hot backups – today it will only make it as a runner up. Not that the standard tools are any better, but because we've developed a mechanism that works on top of storage system snapshots, we can deal with it anyway. No, mysqldump really doesn't do the trick, thanks. Not when you're talking of databases in the range of hundreds of gigs. Another one that I might have wanted to mention was partitioned tablespaces, but I guess when 5.1 eventually becomes GA, I get to offer an opinion on that...

Wednesday 2 May 2007

Good presentations on scaling websites

Peter Van Dijck has collected a bunch of good presentations about developing scalable websites to his blog. Good stuff there, Cal Henderson in particular is always worth a read. I really should sometime collect some of our own learnings about the topic, in particular regarding how scaling challenges change when you're dealing with lots of interactive users with real-time messaging. It certainly has entirely different bottlenecks compared to a typical web site - but the profile of a typical web site is migrating in that direction, too.

Thursday 18 January 2007

First thoughts regarding the MySQL Falcon storage engine

One of my DBA colleagues mentioned that MySQL has released the first alpha version of the Falcon storage engine, which is advertised to most efficiently utilise modern hardware to provide a high-performance scalable replacement for InnoDB, which MySQL naturally tries to reduce dependency of. 

Unfortunately, just based on reading the Falcon documentation, I must draw the conclusion that without extensive further development, it won't be usable for very large installations such as the ones we run for Habbo for a number of reasons. I'm usually much more positive about MySQL, it after all being technology that has enabled Habbo to grow more than 100% every year I've been working on it, but this is a disappointment.

It supports just one tablespace per database, and each tablespace stores all data in a single file. While the concurrency problems of single-file access can be eliminated with careful application of modern kernel, filesystem and disk subsystem technology, single file databases still suffer from major administration issues.

Since a database can't be extended by additional tablespaces and data migrated by the storage engine, you'd better trust your capacity to indefinitely increase available storage space under one filesystem or downtime can't ever become a problem for you. Don't even think about deploying Falcon without a high-end NAS device that supports many times your current storage requirements, reliable logical volume management and an extendable file system. A database is also limited by the filesystem's maximum file size, so make sure that won't be a problem either. I wouldn't recommend ext3 for Falcon.

You'll also need to make backups either via SQL dumping the entire database (not really feasible for daily routine) or by backing up a single file, so either your filesystem, LVM system or storage device must support snapshot backups. Scalability may still become an issue, so be sure that the approach you choose doesn't degrade performance as file size grows.

Just one thread writing to disk may at first blush sound like excellent performance maximisation technique, but it forces you to make a choice between reliability (since it applied to log writes too, transactions are committed to disk in a serialized fashion - no concurrency) and scalability ("commits" to ram cache and background disk flushes certainly will perform well and scale nicely, but what if there's a power failure?). And this is not even the road to highest possible performance - the highest-end disk subsystems will become CPU limited if only one thread will be able to send I/O requests.

With one table space comes one cache/buffer pair, so developers are either forced to split their data model to multiple logical databases or suffer under one unpartitionable system where one bad table scan by one part of the application wipes the buffers from underneath the entire application. A truly modern storage system permits the DBA to assign certain tables or indices to their own caches and buffer spaces and retain a single logical model for software developers. MySQL has never had this ability, and apparently Falcon won't bring it, either.

A more traditional DBA might also cringe at the statement "it is impossible to predict or calculate the disk storage space required for a specific dataset." Many, many complaints could be made about the alpha-release's other restrictions, but I'll give MySQL the chance to keep their promise to address them in forthcoming versions.

I don't really understand which of its features qualify it as technology that utilises modern computers to the best possible effect. Perhaps they're referring to it automatically compressing data on disk? Sure, that may be useful, but it may just as well become a bottleneck when single-row updates require entire pages to be recompressed. Just that feature alone doesn't impress me. It's not more easily administrated, nor does it (on paper at least) address this kind of performance issues. At best, it's an upgrade to MyISAM, but shouldn't be mistaken for a solution to high-performance transactional database requirements.

More on it once I've had a chance to do some practical experimentation (might be a while).

Update: It seems Peter Zaitzev has benchmarked Falcon against MySQL's other storage engines, verifying my suspicion that it doesn't scale properly. Do note that neither MyISAM nor InnoDB show ideal scaling performance either.