Fishpool

To content | To menu | To search

Tag - MySQL

Entries feed - Comments feed

Tuesday 2 December 2008

MySQL 5.1 continues community release madness

So MySQL 5.1 is declared GA, contains enough bugs to make Monty rail against it publically, doesn't (as far as I've been able to gather) have a stable cross-release pluggable storage engine ABI (despite that being a major release feature -- meaning you have to have new plugin releases for each MySQL server release before upgrading anything), and continues the madness of less frequent community releases and less stable enterprise releases against the wishes of every sensible enterprise customer and (I would guess) every community member.

I know enough about running MySQL to know that Monty's comments don't necessarily reflect much on the practicalities of someone trying to run something using MySQL rather than developing it, but the rest of the noise about this release definitely makes me want to seriously reconsider a policy of using MySQL Enterprise version and look at a real community-released version instead for production use.. There's just a "small" question of figuring out which community..

Wednesday 19 November 2008

Looking for a ETL engineer for our BI team

So, I mentioned earlier that I was looking at Infobright's Brighthouse technology as a storage backend for heaps and heaps of traffic and user data from Habbo. Turns out it works fine (now that it's in V3 and supports more of the SQL semantics), and we took it into use. Been pretty happy with that, and I expect to talk more about the challenge and our solution at the next MySQL Conference in April 2009.

However, our DWH team needs extra help. If you're interested in solving business analytics problems by processing lots of data and the idea of working in a company that leads the virtual worlds industry excites you, let us know by sending us an application. Thanks for reading!

Monday 15 September 2008

Infobright BI tools go open source

I've mentioned Infobright before as an interesting solution to getting more performance to BI analytics solutions. Today's news are interesting: Sun invests in the company, and the baseline product is open sourced. Too busy to write more about it today, but I'm certainly watching this one closely.

Saturday 10 May 2008

RIght move, MySQL

Again a week late, but hey, I only need to keep up with this stuff, not comment on it all the time. MySQL changed their minds and turns out the core server will continue to be open source, allowing customers to depend on being able to inspect it if required, extend on any bit as needed, and most importantly, get the benefits of a large community using and testing all features. Thanks for that. I just hope you're going to be consistent about this, for precisely the reason that as a MySQL Enterprise customer, I don't pay you to deliver bits that haven't received that community testing, but to rapidly fix problems if they exist despite that exposure.

It was interesting to hear Monty Widenius comment about it in this week's Open Tuesday event, and I also got to talk to him about attending a MySQL Users session in Helsinki next time I or someone else (anyone? anyone? Bueller?) manage to organize one. Would be nice to hear about the upcoming storage engines straight from the horse's mouth - Monty's Maria effort has certainly been less covered than the Falcon engine I have also commented on, and I can't say to know anything about it myself.

Tuesday 22 April 2008

MySQL Users Conference followup and MySQL's business model

Last week saw MySQL User Conference 2008 in Santa Clara, but I was not able to make time for it this year either. However, in the wake of Sun's acquisition of MySQL, it was very interesting to follow what was going on. A few things that caught my attention:

MySQL 5.1 is nearing General Availability and an interesting storage engine plugin ecosystem starts to emerge. It's this latter, but related event that I see as the first real sign of validation for MySQL's long-ago chosen path of pluggable storage systems instead of focused effort on making one good general-use engine.

Oracle/Innobase announced InnoDB Plugin for MySQL 5.1, which much-awaited features which promise a great deal of help for daily management headaches. More than that, InnoDB Plugin's release under GPL lifts quite a lot of the concern I'm sure many users like us have had about the future viability of InnoDB as MySQL storage engine.

A couple of data warehousing solutions are launched, also based on MySQL 5.1 -- Infobright is one I've already researched somewhat (looks very interesting, as soon as a few current limitations are lifted), Kickfire I know nothing about right now but would love to learn more of.

There's a huge amount of coverage graciously provided by Baron Schwartz that I have yet to fully browse through.

A few remarks by Mårten Mickos regarding MySQL's business model seem to have kicked up a bit of a sandstorm. I don't really understand why; I read these to just verify that the direction MySQL took last year is to continue this year as well. I don't see any major changes here regarding the licensing structure, software availability, or support models. Frankly, it seems like yet another case of Slashdot readers not reading, let alone understanding, what they're protesting against, and press following up on the noise.

I do understand the critique made against MySQL's chosen model, though. In fact, I went on record last September to say that I understand that critique. I still see the same issues here. I believe we represent a fairly common profile of a MySQL Enterprise customer in that what we want from it is not the bleeding-edge functionality but a stable, well-tested product that we can expect to get help for if something does go wrong. We don't see great value in having access to a version of software that isn't generally available to "less advanced" or more adventurous users for free in a community version. In fact, we see it as a negative that such functionality exists, because it hasn't received the community testing, feedback and improvements that makes great open source software as good as it is. While new functionality is interesting, and we're trying to spend time getting familiar with new stuff in order to use it in production later, it simply isn't prudent to put business-critical data in a system that hasn't received real-world testing by as large a community as possible (unless you have no other alternative, and then you takes your chances).

Yet it seems to me that this is essentially what Sun/MySQL continue to propose for the Enterprise customers by delivering "value add" functionality in a special version of the server or plugins to it, possibly in a closed-source form that further reduces transparency and introduces risk. Mårten, I'd prefer it to be otherwise. How can I help you change your mind about this?

Monday 31 March 2008

Optimizing Linux for random I/O on hardware RAID

There's a relatively little known feature about Linux IO scheduling that has a pretty significant effect in large scale database deployments at least with MySQL that a recent article on MySQL Performance Blog prompted me to write about. This may have an effect on other databases and random I/O systems as well, but we've definitely seen it with MySQL 5.0 on RHEL 4 platform. I have not studied this on RHEL 5, and since the IO subsystem and the Completely Fair Queue scheduler that is default on RHEL kernels has received further tuning since, I can not say if it still exists.

Though I've heard YouTube discovered these same things, I have not yet seen a simple explanation of why this is so - so I'll take a shot at explaining it.

In short, a deployment with a RAID controller or external storage system visible to the operating system as a single block device will not reach its maximum performance under RHEL default settings, and can be easily coaxed about 20% higher on average random I/O (and significantly higher in spot benchmarks) with a single kernel parameter (elevator=noop) or equivalent runtime tuning via /sys/block/*/queue/scheduler in RHEL5, where you can also set this on a per-device basis.

We first saw this in 2005 on a quad-CPU server with a RAID controller connected to 10 SCSI disks. At that time, we found that configuring the RAID to expose five RAID-1 pairs which we then striped to a single volume using LVM increased performance despite making the OS and CPU do more work on I/O. The difference in performance was about 20%.

Our most recent proof of the same effect was a quad-CPU server connected to a NetApp storage system over FC. Since it was not convenient to expose multiple volumes from the NetApp to stripe them together, we searched for other solutions, and prompted by a presentation by the YouTube engineers looked at the I/O scheduling options and found a simple way to improve performance was to turn off I/O reordering by the kernel. Again, the overall impact between the settings was about 20%, though at times much greater.

The lesson is simple: reordering I/O requests multiple times provides no benefits, and reordering them too early will in fact be detrimental. Explaining why that is so is a bit involving, and is based on a few assumptions we have not bothered to verify, since the empirical results have supported our conclusions and got us where we wanted.

In order to keep the explanation simple, I will describe it conceptually on a very small scale. When reading this, please take this into account and understand that to measure the effect we have seen in practice, the size of the solution should be increased from what I am describing.

First, consider the case of direct-attached storage exposed to the Linux kernel as independent devices. In this configuration, the kernel maintains a per-device I/O queue, and the CFQ scheduler will reorder I/O requests to each device separately in order to maintain fair per-process balancing, low latency and high throughput. This is the configuration in which CFQ does a great job of maximizing performance, and works fairly well with any amount of spindles. As the application (a database in this case) fires random I/O, each of the spindles is executing them independently and serves requests as soon as they are issued. In other words, the system is good at keeping each of the I/O queues "hot". The sustained top I/O rate is roughly linear to the number of spindles, or with 15k rpm drives, about 1000 ops for four drives.

Now, lets introduce a hardware RAID of some sort, in particular one which is enabled to further reorder operations thanks to a "big" battery-backed up cache. Thanks to that cache, the RAID can commit thousands of write operations per second for fairly long periods (seconds), flushing them to disk after merging. On the other hand, the kernel now sees just one device, and has one I/O queue to it. The CFQ scheduler sits in front of this queue, reordering pending I/O requests. All is fine until the I/O pressure rises up to about what a single spindle can process on a sustained basis, or about 250 requests per second on those 15k drives. However, as soon as the queue starts building up, the CFQ scheduler kicks into action, and reorders the queue from random to sorted per block number (an oversimplification, but close enough).

All is good? No, it's not. The sequential blocks on that RAID volume are not truly sequential, but reside on different spindles and could thus be processed simultaneously. To demonstrate, lets assume your four-spindle array has one billion sectors or five hundred gigs per device, and further, that it is striped at 64k extents or 7.8 million stripes across each device.

On both configurations, the striping is essentially the same. Every 128 sectors or 64k is one one device, then the next one, and so on. The difference is that with LVM in place, the kernel knows this, while with the RAID, it has no idea of the layout of the array, essentially treating it as a single spindle.

Now, those couple of thousand request that were just issued, contain sequences such as writes to sectors 10, 200, 50, 300, 1020, 600, 1500 and 700. Due to the striping, four of these can be executed simultaneously, so the optimal order to issue these, of course depending on what else might be going on, is something like 10, 200, 300, 1500, 50, 700, 1020, and 600, executed through four queues: [10, 50, 600], [200, 700], [300] and [1020, 1500]. In the LVM configuration this might be what really happens. However, the single I/O queue to the RAID device will have these sorted into ascending block order, and with enough such operations in the queue, the RAID processor no longer has enough view to the queue to efficiently re-re-order them to utilize all the spindles, so only some of them are hot at any given time. TCQ should help, but in practice it won't issue enough outstanding requests to fix the problem. In our experience the top sustained rate is not more than 1.5 times one spindle, or 300-400 requests per second, while the array should really run at over the 1000 ops per second thanks to the additional persistent cache on the RAID controller.

Bottom line: CFQ is great, but only if the kernel actually knows everything about the physical layout of the media. It also looks like some of the recently introduced tuning parameters (which I know nothing about, just noted their appearance) might help avoid the worst hit. However, ultimately it doesn't matter - if your hardware allows efficient "outsourcing" of the I/O scheduling to a large secure cache, use it, and don't bother making the kernel do the job without all the information.

I hope this explanation makes sense, and that I haven't botched any important details or made wrong assumptions. Please comment if any of this is inaccurate.

PS. A tuning guide for Oracle recommends the deadline scheduler due to latency guarantees. We have not benchmarked that against noop.

Saturday 29 March 2008

Falcon database engine in MySQL 6.0 alpha

A year ago, I criticized the under-development Falcon storage engine in MySQL 6.0 of failing to meet the demand of large-scale deployments. Falcon has now reached a beta phase and is included in the MySQL 6.0 alpha versions, most recent release of which is 6.0.4 this February. We're thinking of making an early test of Falcon in place of MyISAM/InnoDB for Habbo to see what to expect later on, so I reviewed the documentation again, and thought to look at my concerns from a year ago.

Falcon now supports multiple tablespaces per database, although the corresponding manual page still begins with the unfortunately misleading sentence of "all data ... is stored within a single file", and goes on to correct itself in the second paragraph. To ease volume management, it also allows ALTER TABLE to migrate tables from one tablespace to another, but these are not online operations, so to transfer very large databases, an online backup + offline restore followed by binlog apply and switchover would produce a lower downtime. One tablespace does not (yet?) support multiple data files, so there's still some concern over performant and reliable storage of extremely large single tables.

Not directly related to (only) Falcon, MySQL 6.0 will have BACKUP/RESTORE DATABASE functionality, that at least on paper tries to minimize downtime. However, with the current beta version, online backups are not supported for Falcon. It'll be interesting to see whether this will eliminate need for hairy and failure-prone custom backup solutions in the future.

A read of the threading and commit model of Falcon still leaves me wondering whether really-high-end storage systems and >8 core systems are going to be fully utilized under strenuous I/O. While each execution thread schedules I/O to the serial commit log, only one thread manages the writes of committed data to the data files in order to free up space in the log. As I mentioned before, I/O systems exist that simply can not be fully utilized by just one CPU doing the random access work - witness also degraded performance of same I/O systems with Linux 2.6 CFQ elevator's single sorted I/O queue and the 25% higher throughput with a random-order queue achieved by no-op elevator, ie, optimizing too much to avoid random access can hurt you with large tertiary caches or SSD storage. Still, my impression this time is much better than on the first read of the alpha docs a year ago.

That leaves cache management as a big drawback compared to Oracle, DB2 and the like. The Falcon engine in one MySQL instance has only one index/record cache across all tablespaces, meaning that one bad query causing a table scan will still be able to wipe out all cache/buffers from the entire system, bringing performance to halt for all users.

Foreign keys are not yet supported, either, so a full replacement for InnoDB can not be tested at this point. Persplexingly, these are not mentioned in the GA roadmap either.

On positive note, I'm glad to see Falcon will collect performance metrics to the information schema for flexible access.

Wednesday 16 January 2008

Congrats, MySQL!

Today we read the news that Sun is acquiring MySQL. As MySQL users and customers, it will take a bit more time for us to digest what to think of this, but for MySQL, it's certainly a good achievement, and I want to be among the first to congratulate Monty, Mårten and all other dedicated MySQL folk on this. I hope to hear soon about the sort of development and administration tools Sun is know for to reach also MySQL users, and if that turns out to be the path, then as customers we can happily welcome this event. I welcomed before Sun's open sourcing of Java, another key technology for us, and now we find our primary database solution to be under the same roof.

Update: wow, what a reaction from the stock market today. Sun paid $800M in cash plus $200M in stock options to acquire MySQL, and as a result, its own share (JAVA) went up by 6% or an otherwise very mixed market day.. So they gained that billion back in market cap one in one day. In effect, Sun got MySQL for free. Great move all around :)

I've also had a bit more time to think about the implications for us, and while I certainly want to hear more about the plans, my feeling is that this is probably a good thing for us as a customer as well. MySQL will definitely benefit from Sun's engineering and QA experience - as long as they keep in mind that rapid innovation is going to be expected and required of them, as always.

Monday 22 October 2007

MySQL Helsinki meet, how to back up MyISAM + InnoDB using free tools

Two weeks ago we had the first small get-together of MySQL users in Helsinki. About a dozen people attended, so we had a good chance to talk shop in detail. It would be interesting to expand this group further, plus I know several people who'd probably have interesting stuff to share but couldn't make it. So, we'll have to do this again. Any volunteer organizers?

Our DBAs had put together a bit of material that I spoke about, since they couldn't make it this time. I promised to post it here, but I've been really busy with other stuff, so here goes - better late than never.

How to back up MySQL using free tools

As far as production backups go, mysqldump doesn't really cut it. Too slow, too much load to the database, locks too many things, etc. Innobackup is good, but sometimes you don't want to invest in the licenses. Here's how to do it with free tools only, Linux + LVM (applicable to other environments).

The good:

  • It’s free
  • Works with InnoDB and MyISAM
  • Requires only split second lock for database
  • Backup copy is complete copy of database files
  • Very straightforward approach

The bad:

  • Can only do full backup.
  • Partial database recovery is not possible.
  • Requires split second lock for database. Do not have long queries running when starting backup.
  • Performance is decreased when snapshot is on. Copying hundreds of gigabytes of database files takes long time.
  • Linux LVM is not perfectly stable in high I/O load. Running out of LVM snapshot allocation halts the volume entirely.

How to:

  1. If database is “large” take backup from replica (any big database should afford to have a replica!)
  2. Stop slave! Relay-logs are very easily corrupt and LVM snapshot will corrupt them
  3. Flush tables with read lock (see below)‏
  4. Take snapshot of database volume with lvcreate -s
  5. Release lock from step 3
  6. Mount the snapshot
  7. Copy data to safe place (for transferring to tapes, if that's your desire)
  8. Unmount snapshot and remove it (keeping the snapshots around in production is a bad idea!)
  9. Backup is done
Flush tables with read lock doesn't properly prepare InnoDB for the backup. You have to do it from within a transaction that writes to InnoDB log files in order to get everything consistent:
-- Transaction is needed to fool InnoDB,
-- otherwise InnoDB logs might be corrupt in backup copy.
DROP TABLE IF EXISTS tmp_backup_dummy;
CREATE TABLE tmp_backup_dummy(a INT) TYPE=INNODB;
SET AUTOCOMMIT=0;
INSERT INTO tmp_backup_dummy VALUES (1);
-- Freeze things.
FLUSH TABLES WITH READ LOCK;
COMMIT;

This is fairly trivially set up as a script that can be executed in cron. The backup is a set of the actual binary database files, which can be placed in a new MySQL instance that will recover the InnoDB log files on startup and is a bit-for-bit copy of the original database. You can even continue to apply binary logs on top of this in order to do point-in-time recovery, if you make the backup script record the binlog position. Perhaps we'll share the actual script we use later after cleaning it up some.

Sunday 7 October 2007

MySQL and materialized views

I'm working on alternative strategies to make the use and maintenance of a multi-terabyte data warehouse implementation tolerably fast. For example, it's clear that a reporting query on a 275-million row table is not going to be fun by anyone's definition, but that for most purposes, it can be pre-processed to various aggregated tables of significantly smaller sizes.

However, what is not obvious is what would be the best strategy for creating those tables. I'm working with MySQL 5.0 and Business Objects' Data Integrator XI, so I have a couple of options.

I can just CREATE TABLE ... SELECT ... to see how things work out. This approach is simple to try, but essentially unmaintanable; no good.

I can define the process as a BODI data flow. This is good in many respects, as it creates a documented flow of how the aggregates are updated, is fairly easy to hook up to the workflows which pull in new data from source systems, and allows monitoring of the update processes. However, it's also quite work intensive to create all those objects with the "easy" GUIs in comparison to just writing a a few simple SQL statements. There are also some SQL constructs that are horribly complicated to express in BODI; in particular, COUNT(DISTINCT ..) is ugly.

Or I could create the whole process with views on the original fact table, with triggered updates of a materialized view table in the database. It would still be fairly nicely documentable, thanks to the straightforward structure of the views, and very maintanable, as the updates would be automatic. A deferred update mechanism with a trigger keeping track of which part of the materialized view needs update and a periodic refresh over a stored procedure would keep things nicely in sync. MySQL 5.0 even has all of the necessary functionality.

Except.. It's only there in theory. The performance of views and triggers is so horrible that any such implementation would totally destroy the usability of the system. MySQL's views only work as statement merge when there is a one-to-one relationship between base table and view rows, or in other words, the view can not contain SUM(), AVG(), COUNT() or any of the other mechanisms which would have been the whole point of the materialized view in question. It will fall back to a temp table implementation in these cases, and creating a GROUP BY temp table over 275 million rows without using the WHERE BY clause is pure madness.

In addition, defining any triggers, however simple, slow bulk loads to the base tables by an order of magnitude. I could of course still work around triggers by implementing the equivalent logging in each BODI workflow and create the materialized views and a custom stored proc to update each one, but having a view there in between was the only way to make this approach maintainable. Damn, there goes that strategy.

- page 1 of 2