Some scaling observations on Infobright
A couple of days ago, Baron Schwartz posted some simple load and select benchmarking of MyISAM, Infobright and MonetDB, which Vadim Tkachenko followed up with a more realistic dataset and interesting figures where MonetDB beat Infobright in most queries.
Used to the parallel IEE loader, I was surprised by the apparent slow loading speed of Baron's benchmark and decided to try and replicate it. I installed Infobright 3.2 on my laptop (see, this is very unscientific) and wrote a simple perl script to generate and load an arbitrarily large data set resembling Baron's description. I'm not going to post my exact numbers, because this installation is severely resource-constrained below Infobright's recommended smallest installation. However, you can reproduce the results yourself with the attached script, and I will note some observations.
First of all, this was run on a 1.8GHz Core 2 Duo 2GB RAM laptop running a 64-bit kernel and 64-bit ICE. I stopped most other programs for the duration of the test, but was still running Fedora 11's GNOME Desktop, and gave ICE only 400MB main heap, 200MB compressed heap and 300MB loader heap. What I found:
- Loading speed is almost a linear function of table width. Every time I doubled the number of (random integer) columns in the loaded table, loading speed approximately halved. At 200 columns, I was seeing approximately the same speed as Baron.
- On the other hand, loading speed is NOT affected by the number of rows loaded, from which one could assume that it won't be affected by the size of the pre-existing table either (though I did not test this).
- SELECT speed is not affected by the number of columns in the table, unless those columns are being selected or used in the where constraint. The same select executed in the same speed regardless of whether the source table had 10, 100 or 500 columns which were not being inspected.
- Loading order, column sorting and select constraints are strongly correlated. Limiting a query with a random-value column causes a query which reads 10%, 50% or 90% of the table to execute in approximately 1x, 2x and 3x time. On the other hand, replacing the random-value column in the where constraint with a sorted-value (in load order) column makes queries accessing the same 10%, 50% or 90% of the rows run in nearly constant time. This is the rough set "knowledge grid" in action, but only works if the aggregations done are sum(), min(), max() or other simple functions supported by the grid.
- With large dataset (I stopped at 50x heap space), the constant scale factor in the previous query starts to deteriorate, as the entire knowledge grid no longer fits in heap, and inspecting 90% of it vs 10% will require I/O. At this point the scale is beginning to resemble a realistic production setting, as few people are able to host even 5% of their ADBMS working set in-memory.
- So this is where things get really interesting, and performance characteristics shift around. Random-column constrained queries of 10%, 50% and 90% of rows now run at 1x, 1.1x and 1.3x time, while constant-column constraints execute at 1x, 1.3x and 3x their respective performance, but approximately 3000x faster than the random-constraints!
- The last point shows why it really matters to Infobright workloads that the most frequent queries are taken into account when deciding the load order for the data set. Even when designing an incremental ETL process, it can pay off immensely to pre-sort the incremental data sets by the most likely constraint or group by columns to allow blocks of data (64k rows each) to be included or excluded for particular query plans.
- Why is the random-column constrained queries executing in almost constant time regardless of the rows inspected? Because Infobright's columnar datapack engine accesses 64k rows at a time, it's nearly as expensive to access every tenth row as it is to access nine rows out of every ten, if the distribution of those rows is even across the data set. On the other hand, if the 10% of rows needed are clustered together, then the other 90% of the data set is skipped very early in the query optimization process. A traditional btree-indexed data set would still require a random sweep over most of the index, which would be much larger than Infobright's "knowledge grid" is.