Fishpool

To content | To menu | To search

Wednesday 27 May 2009

What we're looking for in a data integration tool

As our data warehousing process grows and the workflows get more complex, we've revisited the question of what tools to use in this process. Out of curiosity, I had a look at basing such a process on Hadoop/Hive for scalability reasons, but the lack of mature tools and the sacrifices on efficiency that would entail meant we're better off using something else as long as a distributed processing platform is the only thing that can get the job done. I'm also curious about the transition to continuous integration, a model I noticed showing up a couple of years ago and now getting some air under its wings as CEP, IBM's Infosphere Streams, and other similar approaches. Still, I think I'll continue to rely on something else for a while and see how things shake out. Continuous integration clearly is the future, but there are many ways to get there.

So, we had a look at what's going on in the Open Source data integration field. It seems the leaders in that field are Pentaho with Kettle/Pentaho Data Integration, and Talend with Open Studio and Talend Integration Suite. Both seem pretty even in terms of features. Both companies are a bit difficult to approach as a potential customer, so I figured I should also try what would come up from the OSS approach of just posting my thoughts on the Interweb ;)

Besides the technical pilot implementations we've made to compare basic workflow of the various tools, below is a sample of the kind of questions we're considering when evaluating the suitability of the tools.

Product roadmap, release schedule and size of the development team

  • How often and of what scope of changes should we expect and prepare ourselves for platform upgrades?
  • Past track record on keeping to a regular updates schedule

Data lineage and dependency, Impact analysis

  • How to find out which tables are being used to for deriving DWH dimensions and facts?

Logging, auditing, monitoring on row and job level

  • How to monitor and archive workflows on a row level (amount of rows being inserted/updated/deleted)?
  • How to maintain, access and query a job execution history (start time/end time/return code)?

Version control

  • How to track and restore changes in jobs?

Multi-user environment

  • How can several developers work together?

Change Data Capture

  • How to assist incremental loads?

Data profiling

  • How can data source be examined?

Job recovery

  • How to recover from possible failures in jobs (such as lost database connection)?

Deploy jobs

  • How to move jobs from one repository to another (development to testing to production)?

Sunday 24 May 2009

Hello, MySQL 6.0, err, something

I'm conflicted about the latest twist of the MySQL release saga, ie the announcement of the 6.0.11 alpha version and the accompanying note that it's the last 6.0 release and will be replaced by the already discussed milestone model. From an engineering point of view, I think this is the right step. I'm not sure about that, because I can't really tell exactly what is the engineering model chosen: trunk-first, then backport, or fix-in-releases, then forward port. I also can't tell whether the milestone model is going to be timeboxed or feature-scoped. Personally, I would prefer to see the former of both alternatives.

From a customer point of view, I'm even more confused, though much less concerned. Okay, so 6.0 won't become the marketing version number of any MySQL Enterprise release? Doesn't matter. 5.4 needs to come out first anyway, preferably sooner with a concrete, well-tested feature set, than later with more planned-but-unfinished features stuffed in it. What the release after that is going to be called makes no difference to me, as long as it's also going to contain solid improvements and comes out on predictable schedule that doesn't force me to look for something drastically different in order to deal with scale.

That being said, it's still weird. So if the thought of 6.0 GA release is scrapped, why release anything and still call it 6.0? I guess it's just tying loose ends, but that's an engineering thing, and only the number of existing source branches with stuff to merge together matter, not the version number put to it...

Tuesday 12 May 2009

Confusing Sun communication about MySQL 5.4

Just received an email newsletter from Sun titled "MySQL 5.4 Preview Release" which states:

Sun Microsystems recently released MySQL 5.4, delivering performance and scalability improvements enabling the InnoDB storage engine to scale up to 16-way x86 servers and 64-way CMT servers.

MySQL 5.4 also includes new subquery optimizations and JOIN improvements, resulting in 90% better response times for certain queries.

Apparently, the confusion about the contents of the release I wrote about earlier continue to reign inside Sun as well. MySQL 5.4 has not been released by any reasonable meaning of the word, since there's "only" a preview available at this time. Compare this to Windows 7: that's already a Release Candidate, but it has not been released. Also, the preview release available does not include new subquery optimizations nor JOIN improvements. Having planned such improvements doesn't count.

As I wrote earlier, the best of the rather bad excuses for the release labeling offered to me was that Sun wanted to avoid confusion by not releasing many versions at once. I think that got replaced (and then some) by plenty of extra confusion about when and what was released, instead. Sorry, no good. Try again, 'kthxbye.

Monday 4 May 2009

What does Oracle mean for Java?

Over the past two weeks I've been mostly focused on MySQL, but the big-ticket item in the Sun/Oracle deal is not databases, it's Java. However, it's also the domain which is far less clear to predict. It was a big deal when Sun decided to open source Java, but the fact of the matter is that the first fully open source release isn't out yet, and Sun has been keeping the testing and certification kit off-limits for open source communities. This means it would still be far too easy for OpenJDK to be killed off.

I've been keeping clear of Oracle for several years, and can't even begin to guess what their position on this is. Oracle has been a pretty active contributor to Linux in particular for several years, and I'm sure their open source strategy and how it works together with their business is pretty well established within at least the engineering parts of the company. At the same time, their notoriously aggressive market tactics make sure that everyone's wary of their next move. Java is a huge part of Oracle's business, and after they purchased BEA, I wouldn't be surprised if Oracle wasn't already the biggest Java company (in terms of revenue) ahead of both Sun and IBM. After completing the Sun acquisition, that'll be guaranteed.

That's a big balance shift for the overall Java community. Now, Oracle is a smart company. My worry is they might emphasize short-term tactical market advantage (owning all of Java, JRockit, Glassfish and WebLogic to compete against other middleware and business applications) over long-term strategic benefit of a unified platform competing with .NET and the host of open source platforms from PHP and Ruby to Python. With such a wide field, following up on, and improving on the open source platform process would be the right thing to do - and it would help me :)

Thursday 30 April 2009

The difference between conversion and retention

Picked up a piece of analysis today from my newsfeed regarding Twitter audience. Nielsen has posted information about Twitter's month-to-month retention (40%) and compared that to Facebook's and MySpace's. Pete Cashmore over at Mashable promptly misread the basic information and came to an entirely wrong conclusion about the stats, titling his post about it as "60% quit Twitter in the first month". A simple misunderstanding of basic audience analysis like this is the crucial difference between explosively growing traffic and a failure. That's a fail for you, Pete.

What's wrong? Well, retention is a separate matter from conversion. 40% conversion from a trial registration to being a continuing active user to the second month would not be a bad conversion rate. It's not stratospherically great, I've seen better, but I wouldn't be terribly unhappy about such a figure. However, Nielsen didn't say anything at all about first-to-second month conversion. This is what they DID say: "Twitter’s audience retention rate, or the percentage of a given month’s users who come back the following month, is currently about 40 percent."

That's pretty plain English when you take the time to read it. Month to month, regardless of visitor lifetime, not first to second month. On this metric, 40% retention is not good at all, and will definitely be a limiting factor to Twitter's traffic and audience size over time, just the Nielsen article points out (and shows the math for). For any given retention rate, there just is a certain maximum audience reach beyond which any new traffic can't overcome the leaving base, since new traffic is not an inexhaustible supply.

And since today is a busy day, that concludes the free startup advice. Take the time to understand the difference between these metrics, you'll thank yourself for it later.

Tuesday 28 April 2009

The MySQL community outlook

While I can not consider myself a member of MySQL's community of developers, I've been watching those developments the same way I follow the development of Linux and many of the Java and Apache projects our own services depend on. It was great to meet many of the core members of the development community and get some insight into their thoughts about the future.

Baron Schwartz called in his Percona Performance Conference keynote on Thursday for a new, active MySQL community to take the driver's seat in the development of the database, not just in the incremental improvements way of bug fixing and performance improvement, but also by setting a vision for the next generation MySQL. It's a call to action greatly needed, and an important one despite the active existence of the Drizzle project. This is because while Drizzle already has a vision for the future, it's a radical diversion for the MySQL userbase and one which will not necessarily have smooth upgrade path. Many of the same MySQL users feeling most of the pain of MySQL's current limitations are also those who will not be able to easily upgrade to a radically different architecture due to the amount of data and dependencies in their existing infrastructure.

It's a gap which needs a careful approach of incremental changes to the MySQL base functionality to help users bridge over to a new, brighter future. These changes do not need to be slow. Rapid incremental changes are likely to be easier to digest with a clear upgrade and downgrade path from iteration to iteration leaving the organizations with biggest infrastructures to consider a way to set their own pace through the transition, rather than being forced to take one huge leap and risk a crash to the concrete wall of unexpected incompatibility.

A few such pieces of incremental community improvements I learned a great deal of during the week were the performance and scalability improvements by Google and Percona and their MySQL 5.4 equivalents, the Xtrabackup utility not only as an alternative, but improvement on the Innobackup tool which has significant limitations to its use in large-scale deployments, and the Tungsten Replicator providing useful cross-database replication and rapid failover features helping upgrades and transitions to new database installations while minimizing downtime and impact to users. I'm also curious about the storage engine development by Primebase - I don't think there's ultimately a lot of room for multiple transactional storage engines, but as a competitive research topic, it's certainly good to see alternatives to InnoDB.

[Be sure to check out my earlier posts of the conference learnings as well!]

Monday 27 April 2009

Database innovation on MySQL

If MySQL's core server development and release process has been somewhat of a frustration to the userbase over the past few years, clearly another part of the ecosystem has thrived in ways which brought exciting fruit to the Expo part of this year's conference. MySQL has become a hub of innovation in both transactional and analytics databases in ways which have turned many of my concerns to enthusiasm.

I've already discussed the technologies for data analytics on MySQL, in particular Infobright's storage engine technology. This year I took the opportunity to learn a bit more about their appliance-based competitor Kickfire as well, and it certainly looks like a solid product. I still don't completely understand what the "SQL chip" in their appliance does, but certainly the combination of a special-purpose columnar storage, high-speed memory interface and high-performance indexing should form basis for a great analytics system. How it compares in practice to Infobright's software-only approach, time will tell. I'd be interested in real-world experiences, so if you have some to share, please get in touch. Finally, I missed the Calpont info myself, but once it is released, I'll try to get the time to try it out.

I'm even more excited about the new solutions on the transactional side of things. I've certainly been among the people frustrated by MySQL/InnoDB's scaling issues on modern hardware, and glad to see that the optimization work done by Google, Innobase and Percona is being accepted to the "mainline" MySQL Enterprise Server. However, what I did not expect to see were the solutions shown by Virident and Schooner for accelerated, Flash-based storage appliances. It's interesting how both of these companies have chosen to apply their platforms to accelerate both InnoDB and Memcached, and I'm looking forward to the chance to spend more time with both solutions. While both are Flash-based approaches, they seem to have taken very different architectural choices in the way they're exposing the memory to the software layer, and I'm curious to see the impact those choices have on both IO and storage capacity scaling. In any event, these are unique technologies unlike what I've seen for other platforms at this time. I need to learn how they plan to work with the community and Sun/Oracle in keeping the solutions functionally compatible with standard MySQL server.

The ecosystem doesn't end at the appliances, though. On the software side of things, I was pleasantly surprised by the state of Primebase's PBXT storage engine as well as Continuent's new Tungsten Replicator. While both are still early in their development path, they seem to hold a lot of promise for improving the performance of MySQL's built-in functionality in InnoDB as well as in the replication subsystem. Robert Hodges's demo of Tungsten's set-up and management also looked like it will greatly simplify replication administration, which is a big deal for anyone who has to manage 20+ replicated database systems. What's more, if Robert and his team crack the multi-threaded replication problem, and major scalability concern is lifted.

[Be sure to check out my earlier posts of the conference learnings as well!]

Sunday 26 April 2009

MySQL 2009-2010 roadmap

The development model for MySQL Enterprise took a big step forward with the new community process Karen Padir announced in her Tuesday keynote. This is great for both the open source server as well as enterprise customers, because the closer the tie between the community and the development path, the better the quality and faster the progress towards new functionality. I'm not entirely sure everyone at Sun still completely understands why a working community process is a benefit for the enterprise customer base, but I'm happy steps are made in the right direction, and it seems to me that Karen Padir is going to be a good leader for the product.

A big improvement, for sure, and still there's more to improve here. To borrow the words of Baron Schwartz, MySQL currently "has" a community, while it would really be in everyone's benefit if instead MySQL would "be" a community. I would suggest that the goal should be not monthly "community" releases from Sun, but a completely out-in-the-open development process with the community members being on the driving seat regarding patch acceptance, quality management and releases, much like the Fedora process works. Sure, there's a role for corporate sponsorship and project management, but it's a distinct difference of responsibility. The Drizzle project is another good example of how this can work. An important point to realize here is that there is a difference between the community, an active partner in the process of making the software better, and the unpaid userbase. The latter is an acquisition and conversion vehicle for the former, but they're separate entities.

The announcement of the 5.4 server was at the same time an encouraging as well as confusing example of the changes. I would like to be enthuastic about it, but we've seen MySQL (if not Sun) announce pre-announce releases that didn't appear before, and it's a long way to the promised release time. I asked two questions from many, many MySQL staff members during the week: why is it that 5.4 was announced now, but is slated to be released GA only in December when it clearly demonstrates massive scalability improvements already, and why is it that the feature list for the final 5.4 release is much longer than what's already completed? I did not get a really coherent answer from anyone. Best I could decipher, there is somewhere a faceless "marketing" which decided that a) there should only be one release announced and b) 40% demonstrated improvement is not good enough when it's not the only improvement that can be made. I also learned that it's not unlikely that much of the work which has gone to 5.4.0-beta would be backported to the 5.1 branch and released in a 5.1 point release before the actual 5.4 release, because in fact they can be considered bugfixes.

I consider myself not an entirely unexperienced in the decision processes for release management, and know intimately the clarity hindsight provides to well-intentioned choices made with best available information. I know there are many areas to consider, and every decision made is a compromise. I still can't bring myself to completely understand what exactly led to this particular approach. Lets recap:

  • Improvements already made are announced and made available in beta test form, but beta does not contain everything planned for the release
  • Final release is intentionally delayed by 7 months adding significant project risk to it, despite having no previously committed release schedule
  • Former release version is planned to by improved by making significant performance-altering changes in a point release in order to offset the delay
  • Such a release adds risk to maintenance roadmap and steals away upgrade motivation from the upcoming version

How this plan serves either Sun, the community, the free userbase or the enterprise customers is a mystery to me. It would certainly seem far simpler and clearer to take an aggressive quality assurance and release testing position with the intent to push 5.4 out as a rock-solid replacement upgrade to 5.1 as soon as possible, and only then continue with further updates as a 5.5 release. This would definitely be welcomed by everyone but the class of enterprise customers who like to hear about future versions two years in advance - but keep in mind that such conservative enterprises are not MySQL's primary customer base anyway, and if MySQL is to make inroads there, rapidly improving the quality and performance of the product in the meantime would still be a sensible step.

There is the argument that if I want to get those performance features now, I can use Percona/XtraDB or MySQL 5.1 plus the InnoDB Plugin. While technically that route does work, and clearly is worth pursuing as a user, it does have its drawbacks in terms of requiring multiple sources and it's hard to see how it supports MySQL/Sun's commercial interests, the latter surely having been a consideration in the 5.4 release plans.

Thus far in the argument I have ignored one new component - Oracle. That's because to my understanding the process I've discussed did not consider the acquisition, which was unknown to most people before Monday. Clearly this changes a few points. It's not necessarily in the interests of Oracle for MySQL to continue making inroads to enterprise customers, though if someone's going to be cannibalizing Oracle's database sales, it might as well be Oracle. InnoDB Plugin will also be a product from the same company as MySQL Server in the near future - in fact, in a future likely to be fact before the final GA release of MySQL 5.4. What is the role of a delayed 5.4 release in this equation, then?

Recap of MySQL Conference 2009

This was an interesting week for sure. Of course, we all know it started with a bit of a shock news, but that's not nearly the most interesting bit about the conference. I'm posting a series of cleaned-up notes and opinions about what I saw there as I finish them. Will also try to link to further information where I've seen good notes. Please leave more links in the comments if you have any!

Thursday 23 April 2009

Three domains of data

My MySQL Conference presentation on Tuesday discussed my practical findings on how Infobright's technology works in developing a MySQL-based data warehouse. I also touched on a more high-level question of how to select a technology for a different kinds of data-related problem areas, and this article expands on that discussion.

Continue reading...

- page 1 of 20