Fishpool

To content | To menu | To search

Tag - enterprise

Entries feed - Comments feed

Wednesday 27 May 2009

What we're looking for in a data integration tool

As our data warehousing process grows and the workflows get more complex, we've revisited the question of what tools to use in this process. Out of curiosity, I had a look at basing such a process on Hadoop/Hive for scalability reasons, but the lack of mature tools and the sacrifices on efficiency that would entail meant we're better off using something else as long as a distributed processing platform is the only thing that can get the job done. I'm also curious about the transition to continuous integration, a model I noticed showing up a couple of years ago and now getting some air under its wings as CEP, IBM's Infosphere Streams, and other similar approaches. Still, I think I'll continue to rely on something else for a while and see how things shake out. Continuous integration clearly is the future, but there are many ways to get there.

So, we had a look at what's going on in the Open Source data integration field. It seems the leaders in that field are Pentaho with Kettle/Pentaho Data Integration, and Talend with Open Studio and Talend Integration Suite. Both seem pretty even in terms of features. Both companies are a bit difficult to approach as a potential customer, so I figured I should also try what would come up from the OSS approach of just posting my thoughts on the Interweb ;)

Besides the technical pilot implementations we've made to compare basic workflow of the various tools, below is a sample of the kind of questions we're considering when evaluating the suitability of the tools.

Product roadmap, release schedule and size of the development team

  • How often and of what scope of changes should we expect and prepare ourselves for platform upgrades?
  • Past track record on keeping to a regular updates schedule

Data lineage and dependency, Impact analysis

  • How to find out which tables are being used to for deriving DWH dimensions and facts?

Logging, auditing, monitoring on row and job level

  • How to monitor and archive workflows on a row level (amount of rows being inserted/updated/deleted)?
  • How to maintain, access and query a job execution history (start time/end time/return code)?

Version control

  • How to track and restore changes in jobs?

Multi-user environment

  • How can several developers work together?

Change Data Capture

  • How to assist incremental loads?

Data profiling

  • How can data source be examined?

Job recovery

  • How to recover from possible failures in jobs (such as lost database connection)?

Deploy jobs

  • How to move jobs from one repository to another (development to testing to production)?

Sunday 24 May 2009

Hello, MySQL 6.0, err, something

I'm conflicted about the latest twist of the MySQL release saga, ie the announcement of the 6.0.11 alpha version and the accompanying note that it's the last 6.0 release and will be replaced by the already discussed milestone model. From an engineering point of view, I think this is the right step. I'm not sure about that, because I can't really tell exactly what is the engineering model chosen: trunk-first, then backport, or fix-in-releases, then forward port. I also can't tell whether the milestone model is going to be timeboxed or feature-scoped. Personally, I would prefer to see the former of both alternatives.

From a customer point of view, I'm even more confused, though much less concerned. Okay, so 6.0 won't become the marketing version number of any MySQL Enterprise release? Doesn't matter. 5.4 needs to come out first anyway, preferably sooner with a concrete, well-tested feature set, than later with more planned-but-unfinished features stuffed in it. What the release after that is going to be called makes no difference to me, as long as it's also going to contain solid improvements and comes out on predictable schedule that doesn't force me to look for something drastically different in order to deal with scale.

That being said, it's still weird. So if the thought of 6.0 GA release is scrapped, why release anything and still call it 6.0? I guess it's just tying loose ends, but that's an engineering thing, and only the number of existing source branches with stuff to merge together matter, not the version number put to it...

Monday 15 September 2008

Infobright BI tools go open source

I've mentioned Infobright before as an interesting solution to getting more performance to BI analytics solutions. Today's news are interesting: Sun invests in the company, and the baseline product is open sourced. Too busy to write more about it today, but I'm certainly watching this one closely.

Saturday 10 May 2008

RIght move, MySQL

Again a week late, but hey, I only need to keep up with this stuff, not comment on it all the time. MySQL changed their minds and turns out the core server will continue to be open source, allowing customers to depend on being able to inspect it if required, extend on any bit as needed, and most importantly, get the benefits of a large community using and testing all features. Thanks for that. I just hope you're going to be consistent about this, for precisely the reason that as a MySQL Enterprise customer, I don't pay you to deliver bits that haven't received that community testing, but to rapidly fix problems if they exist despite that exposure.

It was interesting to hear Monty Widenius comment about it in this week's Open Tuesday event, and I also got to talk to him about attending a MySQL Users session in Helsinki next time I or someone else (anyone? anyone? Bueller?) manage to organize one. Would be nice to hear about the upcoming storage engines straight from the horse's mouth - Monty's Maria effort has certainly been less covered than the Falcon engine I have also commented on, and I can't say to know anything about it myself.

Tuesday 22 April 2008

MySQL Users Conference followup and MySQL's business model

Last week saw MySQL User Conference 2008 in Santa Clara, but I was not able to make time for it this year either. However, in the wake of Sun's acquisition of MySQL, it was very interesting to follow what was going on. A few things that caught my attention:

MySQL 5.1 is nearing General Availability and an interesting storage engine plugin ecosystem starts to emerge. It's this latter, but related event that I see as the first real sign of validation for MySQL's long-ago chosen path of pluggable storage systems instead of focused effort on making one good general-use engine.

Oracle/Innobase announced InnoDB Plugin for MySQL 5.1, which much-awaited features which promise a great deal of help for daily management headaches. More than that, InnoDB Plugin's release under GPL lifts quite a lot of the concern I'm sure many users like us have had about the future viability of InnoDB as MySQL storage engine.

A couple of data warehousing solutions are launched, also based on MySQL 5.1 -- Infobright is one I've already researched somewhat (looks very interesting, as soon as a few current limitations are lifted), Kickfire I know nothing about right now but would love to learn more of.

There's a huge amount of coverage graciously provided by Baron Schwartz that I have yet to fully browse through.

A few remarks by Mårten Mickos regarding MySQL's business model seem to have kicked up a bit of a sandstorm. I don't really understand why; I read these to just verify that the direction MySQL took last year is to continue this year as well. I don't see any major changes here regarding the licensing structure, software availability, or support models. Frankly, it seems like yet another case of Slashdot readers not reading, let alone understanding, what they're protesting against, and press following up on the noise.

I do understand the critique made against MySQL's chosen model, though. In fact, I went on record last September to say that I understand that critique. I still see the same issues here. I believe we represent a fairly common profile of a MySQL Enterprise customer in that what we want from it is not the bleeding-edge functionality but a stable, well-tested product that we can expect to get help for if something does go wrong. We don't see great value in having access to a version of software that isn't generally available to "less advanced" or more adventurous users for free in a community version. In fact, we see it as a negative that such functionality exists, because it hasn't received the community testing, feedback and improvements that makes great open source software as good as it is. While new functionality is interesting, and we're trying to spend time getting familiar with new stuff in order to use it in production later, it simply isn't prudent to put business-critical data in a system that hasn't received real-world testing by as large a community as possible (unless you have no other alternative, and then you takes your chances).

Yet it seems to me that this is essentially what Sun/MySQL continue to propose for the Enterprise customers by delivering "value add" functionality in a special version of the server or plugins to it, possibly in a closed-source form that further reduces transparency and introduces risk. Mårten, I'd prefer it to be otherwise. How can I help you change your mind about this?

Sunday 7 October 2007

MySQL and materialized views

I'm working on alternative strategies to make the use and maintenance of a multi-terabyte data warehouse implementation tolerably fast. For example, it's clear that a reporting query on a 275-million row table is not going to be fun by anyone's definition, but that for most purposes, it can be pre-processed to various aggregated tables of significantly smaller sizes.

However, what is not obvious is what would be the best strategy for creating those tables. I'm working with MySQL 5.0 and Business Objects' Data Integrator XI, so I have a couple of options.

I can just CREATE TABLE ... SELECT ... to see how things work out. This approach is simple to try, but essentially unmaintanable; no good.

I can define the process as a BODI data flow. This is good in many respects, as it creates a documented flow of how the aggregates are updated, is fairly easy to hook up to the workflows which pull in new data from source systems, and allows monitoring of the update processes. However, it's also quite work intensive to create all those objects with the "easy" GUIs in comparison to just writing a a few simple SQL statements. There are also some SQL constructs that are horribly complicated to express in BODI; in particular, COUNT(DISTINCT ..) is ugly.

Or I could create the whole process with views on the original fact table, with triggered updates of a materialized view table in the database. It would still be fairly nicely documentable, thanks to the straightforward structure of the views, and very maintanable, as the updates would be automatic. A deferred update mechanism with a trigger keeping track of which part of the materialized view needs update and a periodic refresh over a stored procedure would keep things nicely in sync. MySQL 5.0 even has all of the necessary functionality.

Except.. It's only there in theory. The performance of views and triggers is so horrible that any such implementation would totally destroy the usability of the system. MySQL's views only work as statement merge when there is a one-to-one relationship between base table and view rows, or in other words, the view can not contain SUM(), AVG(), COUNT() or any of the other mechanisms which would have been the whole point of the materialized view in question. It will fall back to a temp table implementation in these cases, and creating a GROUP BY temp table over 275 million rows without using the WHERE BY clause is pure madness.

In addition, defining any triggers, however simple, slow bulk loads to the base tables by an order of magnitude. I could of course still work around triggers by implementing the equivalent logging in each BODI workflow and create the materialized views and a custom stored proc to update each one, but having a view there in between was the only way to make this approach maintainable. Damn, there goes that strategy.

Friday 21 September 2007

MySQL Community vs Enterprise tension

I probably don't spend quite enough time following progress around MySQL considering how critical the product is to us. I'd like to consider it part of the infrastructure in a way I treat Red Hat Enterprise Linux, ie something I can trust to make good progress and follow up on a quarterly basis. Naturally we have people who watch both much more closely, but my time simply should, and pretty much is, spent doing something else.

However, it seems MySQL really demands a bit more attention right now. Today I went and read Jeremy Cole's opinion about MySQL Community (a failure), and I have to say I agree on many of the points. MySQL simply has not yet found a model that works as well as that of Red Hat's Fedora vs Enterprise Linux - that is, really giving the Community edition to the community to direct, and using the Enterprise edition as a platform for enterprises to depend on.

I feel the fundamental problem really is quite simple; as long as MySQL maintains the community edition (both binaries AND the source tree) themselves, and don't let the community integrate features to it on a timely basis, the model will not function, not even to their paying customers (us included). However, if they reverse this particular point from the current status-quo, all of the other benefits are inevitable.

The comparison to Fedora and RHEL is rather obvious, despite the distribution vs single product differences. Fedora is a great community Linux distribution with the latest-and-greatest features integrated to it on a very timely fashion. Not even Ubuntu can really compete with Fedora in terms of features. However, what Fedora gives up to reach this is a certain amount of polish and reliability. I will happily use Fedora as a personal platform, because of the latest features, but I would not pretend to run a stable system on top of it. For that, I'll rather choose something a bit more mature, that has proven itself in the community and received further QA ahead of commercial release. This is RHEL, and this is what the MySQL Enterprise should be. A version that, when it's released, I shouldn't have to hesitate to install on a new production server.

I also today learned about the Dorsal Source MySQL community release. Now this looks like something MySQL Community release probably should be like. I'll have to give it a test round and see what's up.

Update: Baron Schwartz describes a MySQL Enterprise that I would have far less trouble using than the existing one..