What we're looking for in a data integration tool
By Osma on Wednesday 27 May 2009, 20:05 - Permalink
As our data warehousing process grows and the workflows get more complex, we've revisited the question of what tools to use in this process. Out of curiosity, I had a look at basing such a process on Hadoop/Hive for scalability reasons, but the lack of mature tools and the sacrifices on efficiency that would entail meant we're better off using something else as long as a distributed processing platform is the only thing that can get the job done. I'm also curious about the transition to continuous integration, a model I noticed showing up a couple of years ago and now getting some air under its wings as CEP, IBM's Infosphere Streams, and other similar approaches. Still, I think I'll continue to rely on something else for a while and see how things shake out. Continuous integration clearly is the future, but there are many ways to get there.
So, we had a look at what's going on in the Open Source data integration field. It seems the leaders in that field are Pentaho with Kettle/Pentaho Data Integration, and Talend with Open Studio and Talend Integration Suite. Both seem pretty even in terms of features. Both companies are a bit difficult to approach as a potential customer, so I figured I should also try what would come up from the OSS approach of just posting my thoughts on the Interweb ;)
Besides the technical pilot implementations we've made to compare basic workflow of the various tools, below is a sample of the kind of questions we're considering when evaluating the suitability of the tools.
Product roadmap, release schedule and size of the development team
- How often and of what scope of changes should we expect and prepare ourselves for platform upgrades?
- Past track record on keeping to a regular updates schedule
Data lineage and dependency, Impact analysis
- How to find out which tables are being used to for deriving DWH dimensions and facts?
Logging, auditing, monitoring on row and job level
- How to monitor and archive workflows on a row level (amount of rows being inserted/updated/deleted)?
- How to maintain, access and query a job execution history (start time/end time/return code)?
Version control
- How to track and restore changes in jobs?
Multi-user environment
- How can several developers work together?
Change Data Capture
- How to assist incremental loads?
Data profiling
- How can data source be examined?
Job recovery
- How to recover from possible failures in jobs (such as lost database connection)?
Deploy jobs
- How to move jobs from one repository to another (development to testing to production)?
Comments
Why limit yourself to these features when doing an evaluation? Why not go for the full blown list of 38 subsystems as defined by Ralph Kimball himself?
http://www.intelligententerprise.co...
The question was non-rhetorical. I guess what is comes down to is that there is always a price to pay for all those extra features in terms of complexity, monetary value, setup time, learning, etc.
Pentaho data integration is quite busy filling in the gap with work started on most of the items you listed with the exception probably of Change Data Capture (It's a methodology, not a feature) but even there we have plans to do continuous integration with 3rd party tools. The great care that we take in the process is making sure that most (if not all) of the new features remain optional and that complexity stays low. As far as pricing is concerned, *anything* is cheaper than the classical closed source vendors. That is *especially* the case if you add up the price for all these extras. I guess that explains why a lot of customers choose to simply check in their PDI transformations and jobs in their own version control system (like open source Subversion). Or why they would use a third party (open or closed source) profiling tool on the side. (it's only mildly related to ETL anyway) Hybrid solutions can mean a lot of cost savings.
Another important question to answer is that of relative importance of the features you listed:
- How important is the feature (nice-to-have or critical)?
- How much money can you spend on it and still have a ROI?
- Can you shop around for the feature in question to reduce costs?
- Do you really have the luxury of paying any price or (more likely) are there limits? Where are these limits?
- etc.
All these things will have a huge impact on the decision making process.
Take care,
Matt
I'm sure I'll get slammed for this, but I'm not sure why you're limiting yourself to only open source vendors.
Would you build a Data Warehouse only on MySQL or would you also look at Oracle, Teradata or Vertica or Netezza (granted, some of those products have open source components in them, but plenty of proprietary magic too).
Pentaho and Talend are so far behind products like PowerCenter (INFA) or Pervasive Data Integrator. You'll spend a fortune over the lifetime of the project filling in the functionality "delta" between full-function mature products and the open source integration products.
Just cuz a dog is free doesn't mean it's not very expensive.
Winslow, you certainly won't be slammed by me -- that's an eminently reasonable question. It's just one which I've skipped in this public discussion.
We are already using commercial data integration products. However, their licensing mechanisms are a distinct drawback, and we typically extract significantly more value from open source tools. Yes, at times they can't offer as rich of a feature set, but the development model just suits our practices better.
That said, it's not a done deal by any measure, and when we started work in this arena, proprietary solutions won the argument with ease.