Fishpool

To content | To menu | To search

Tag - data warehousing

Entries feed - Comments feed

Saturday 3 October 2009

Some scaling observations on Infobright

A couple of days ago, Baron Schwartz posted some simple load and select benchmarking of MyISAM, Infobright and MonetDB, which Vadim Tkachenko followed up with a more realistic dataset and interesting figures where MonetDB beat Infobright in most queries.

Used to the parallel IEE loader, I was surprised by the apparent slow loading speed of Baron's benchmark and decided to try and replicate it. I installed Infobright 3.2 on my laptop (see, this is very unscientific) and wrote a simple perl script to generate and load an arbitrarily large data set resembling Baron's description. I'm not going to post my exact numbers, because this installation is severely resource-constrained below Infobright's recommended smallest installation. However, you can reproduce the results yourself with the attached script, and I will note some observations.

Continue reading...

Monday 21 September 2009

A peek under the hood in Infobright 3.2 storage engine

I've been meaning to post some real-world data on the performance of the Infobright 3.2 release which happened a few weeks ago after an extended release candidate period. We're just preparing our upgrades now, so I don't have any performance notes over significant data sets or complicated queries to post quite yet.

To make up for that, I decided to address a particular annoyance of mine in the community edition, first because it hadn't been addressed in the 3.2 release (and really, I'm hoping doing this would include it into 3.2.1), and second, simply because the engine being open source means I can. I feel being OSS is one of Infobright's biggest strengths, in addition to being a pretty amazing piece of performance for such a simple, undemanding package in general, and not making use of that would be shame. Read on for details.

Continue reading...

Wednesday 27 May 2009

What we're looking for in a data integration tool

As our data warehousing process grows and the workflows get more complex, we've revisited the question of what tools to use in this process. Out of curiosity, I had a look at basing such a process on Hadoop/Hive for scalability reasons, but the lack of mature tools and the sacrifices on efficiency that would entail meant we're better off using something else as long as a distributed processing platform is the only thing that can get the job done. I'm also curious about the transition to continuous integration, a model I noticed showing up a couple of years ago and now getting some air under its wings as CEP, IBM's Infosphere Streams, and other similar approaches. Still, I think I'll continue to rely on something else for a while and see how things shake out. Continuous integration clearly is the future, but there are many ways to get there.

So, we had a look at what's going on in the Open Source data integration field. It seems the leaders in that field are Pentaho with Kettle/Pentaho Data Integration, and Talend with Open Studio and Talend Integration Suite. Both seem pretty even in terms of features. Both companies are a bit difficult to approach as a potential customer, so I figured I should also try what would come up from the OSS approach of just posting my thoughts on the Interweb ;)

Besides the technical pilot implementations we've made to compare basic workflow of the various tools, below is a sample of the kind of questions we're considering when evaluating the suitability of the tools.

Product roadmap, release schedule and size of the development team

  • How often and of what scope of changes should we expect and prepare ourselves for platform upgrades?
  • Past track record on keeping to a regular updates schedule

Data lineage and dependency, Impact analysis

  • How to find out which tables are being used to for deriving DWH dimensions and facts?

Logging, auditing, monitoring on row and job level

  • How to monitor and archive workflows on a row level (amount of rows being inserted/updated/deleted)?
  • How to maintain, access and query a job execution history (start time/end time/return code)?

Version control

  • How to track and restore changes in jobs?

Multi-user environment

  • How can several developers work together?

Change Data Capture

  • How to assist incremental loads?

Data profiling

  • How can data source be examined?

Job recovery

  • How to recover from possible failures in jobs (such as lost database connection)?

Deploy jobs

  • How to move jobs from one repository to another (development to testing to production)?

Thursday 23 April 2009

Three domains of data

My MySQL Conference presentation on Tuesday discussed my practical findings on how Infobright's technology works in developing a MySQL-based data warehouse. I also touched on a more high-level question of how to select a technology for a different kinds of data-related problem areas, and this article expands on that discussion.

Continue reading...

Wednesday 19 November 2008

Looking for a ETL engineer for our BI team

So, I mentioned earlier that I was looking at Infobright's Brighthouse technology as a storage backend for heaps and heaps of traffic and user data from Habbo. Turns out it works fine (now that it's in V3 and supports more of the SQL semantics), and we took it into use. Been pretty happy with that, and I expect to talk more about the challenge and our solution at the next MySQL Conference in April 2009.

However, our DWH team needs extra help. If you're interested in solving business analytics problems by processing lots of data and the idea of working in a company that leads the virtual worlds industry excites you, let us know by sending us an application. Thanks for reading!