As our data warehousing process grows
and the workflows get more complex, we've revisited the question of what tools
to use in this process. Out of curiosity, I had a look at basing such a process
on Hadoop/Hive for scalability reasons, but the
lack of mature tools and the sacrifices on efficiency that would entail meant
we're better off using something else as long as a distributed processing
platform is the only thing that can get the job done. I'm also curious about
the transition to continuous integration, a model I noticed showing up a couple
of years ago and now getting some air under its wings as CEP, IBM's Infosphere
Streams, and other similar approaches. Still, I think I'll continue to rely
on something else for a while and see how things shake out. Continuous
integration clearly is the future, but there are many ways to get there.
So, we had a look at what's going on in the Open Source data integration
field. It seems the leaders in that field are Pentaho with Kettle/Pentaho Data
Integration, and Talend with Open
Studio and Talend Integration Suite. Both seem pretty even in terms of
features. Both companies are a bit difficult to approach as a potential
customer, so I figured I should also try what would come up from the OSS
approach of just posting my thoughts on the Interweb ;)
Besides the technical pilot implementations we've made to compare basic
workflow of the various tools, below is a sample of the kind of questions we're
considering when evaluating the suitability of the tools.
Product roadmap, release schedule and size of the development team
- How often and of what scope of changes should we expect and prepare
ourselves for platform upgrades?
- Past track record on keeping to a regular updates schedule
Data lineage and dependency, Impact analysis
- How to find out which tables are being used to for deriving DWH dimensions
and facts?
Logging, auditing, monitoring on row and job level
- How to monitor and archive workflows on a row level (amount of rows being
inserted/updated/deleted)?
- How to maintain, access and query a job execution history (start time/end
time/return code)?
Version control
- How to track and restore changes in jobs?
Multi-user environment
- How can several developers work together?
Change Data Capture
- How to assist incremental loads?
Data profiling
- How can data source be examined?
Job recovery
- How to recover from possible failures in jobs (such as lost database
connection)?
Deploy jobs
- How to move jobs from one repository to another (development to testing to
production)?