As our data warehousing process grows and the workflows get more complex, we've revisited the question of what tools to use in this process. Out of curiosity, I had a look at basing such a process on Hadoop/Hive for scalability reasons, but the lack of mature tools and the sacrifices on efficiency that would entail meant we're better off using something else as long as a distributed processing platform is the only thing that can get the job done. I'm also curious about the transition to continuous integration, a model I noticed showing up a couple of years ago and now getting some air under its wings as CEP, IBM's Infosphere Streams, and other similar approaches. Still, I think I'll continue to rely on something else for a while and see how things shake out. Continuous integration clearly is the future, but there are many ways to get there.
So, we had a look at what's going on in the Open Source data integration field. It seems the leaders in that field are Pentaho with Kettle/Pentaho Data Integration, and Talend with Open Studio and Talend Integration Suite. Both seem pretty even in terms of features. Both companies are a bit difficult to approach as a potential customer, so I figured I should also try what would come up from the OSS approach of just posting my thoughts on the Interweb ;)
Besides the technical pilot implementations we've made to compare basic workflow of the various tools, below is a sample of the kind of questions we're considering when evaluating the suitability of the tools.
Product roadmap, release schedule and size of the development team
- How often and of what scope of changes should we expect and prepare ourselves for platform upgrades?
- Past track record on keeping to a regular updates schedule
Data lineage and dependency, Impact analysis
- How to find out which tables are being used to for deriving DWH dimensions and facts?
Logging, auditing, monitoring on row and job level
- How to monitor and archive workflows on a row level (amount of rows being inserted/updated/deleted)?
- How to maintain, access and query a job execution history (start time/end time/return code)?
Version control
- How to track and restore changes in jobs?
Multi-user environment
- How can several developers work together?
Change Data Capture
- How to assist incremental loads?
Data profiling
- How can data source be examined?
Job recovery
- How to recover from possible failures in jobs (such as lost database connection)?
Deploy jobs
- How to move jobs from one repository to another (development to testing to production)?