Apart from the primary "here's how we ended up using Infobright for data
warehousing and how is that working out" topic I'm going to
discuss in my MySQL Conf presentation I'll touch on another application,
the use of Infobright's open-source
Community Edition server for collection and storage of event logs. This is
a system we've implemented in the past couple of months to solve a number of
data management problems that were gradually becoming problematic for our
infrastructure.
We've traditionally stored structured event log data in databases for ease
of management. Since Habbo uses MySQL for most everything else, putting the log
tables in the same databases was pretty natural. However, there are significant
problems to this approach:
- MyISAM tables suffer from concurrency issues and crash-recovery is very
slow due to table consistency check
- InnoDB tables suffer from I/O bottlenecks and crash-recovery is very slow
due to rollback segment processing
- Both scale badly to hundreds of millions of rows (especially if indexing is
required), and mixing them is not a recommended practice
- Storage becomes an issue over time, especially as indexes can easily
require many times as much disk than the data, and an event log is going to
have a LOT of rows
- Partitioning has only recently become available, and before that, managing
"archive" tables needed manual effort
- Perhaps worst of all (as it's very hard to measure), if any of this is
happening on the primary DB servers, it's competing for buffer pool memory with
transactional tables, thus slowing down everything due to cache misses
Over the years, we've tackled these issues in many ways. However, with our
initial experience of scaling an Infobright installation for data warehousing
needs, a pretty simple solution became apparent, and we rapidly implemented an
asynchronous, buffered mechanism to stream data into an ICE database. We're
early with this implementation, but it has turned out to be a satisfactory
high-performance solution. Even better, it's a very simple thing to implement,
even in a clustered service spanning many hosts, as long as log tables don't
need to be guaranteed 100% complete or up-to-date to the last second. Here's a
description of the simple solution; extending that to the complex solution
providing those guarantees is left as an exercise to the reader.
Rather than running single INSERTs to a log table or writing lines to a text
file log, each server buffers a small set of events, eg for the past second in
a memory buffer. These are then sent over a message bus or lightweight RPC call
to a log server, which writes them to a log file that is closed and switched to
a new file after every megarow or every few minutes, whichever is smaller. A
second process running on this log server wakes up periodically and loads each
of these files (minus the last one, which is still being written to) into the
database with LOAD DATA INFILE.
This has multiple general benefits:
- Buffered messaging requires much less time on the "client" servers compared
to managing database connections and executing thousands of small
transactions
- The asynchronous processing ensures database contention can not produce
random delays to the normal service operation
- Batch loading of text files is implemented by every DB server, so there's
little in this implementation that is proprietary or dependent on any
particular DB solution
Using the
Infobright ICE as the
backend database provides a number of additional specific benefits:
- Excellent data load performance
- No index management, yet capability to run queries on the data without
first extracting it to another DB
- No degradation of performance as deployment size grows, as would happen
even to a MyISAM table should it have any indexes
- Compressed storage, so less spinning iron required
- Columnar datapack organization should not require table partitioning even
over long periods
This works very well for structured events. For unstructured data, a different
solution is required, which I will discuss at some later date.
Update: Mark Callaghan asked in the comments for some
quantified details. We have not spent the time to produce repeatable
benchmarks, so all I can offer on that front is anecdotal data - it's very
conclusive for
us, given it's addressing our real concerns, but less
so for others. That said, ICE does not support inserts, only batch loads, so
the solution had to be engineered to use that, which added some complexity, but
brought orders of magnitude more performance. A simple benchmark run showed
that the end-to-end performance for this exceeded 100,000 events per second
when running all parts of the client-logserver-database chain on a single
desktop machine.
Query performance depends on the queries made. Summary data is 2-3 orders of
magnitude faster to access, the bigger the dataset, the bigger the performance
benefit - but expecting that for single row accesses would disappoint badly.
Storage compression varies wildly depending on the data in question -- we've
seen up to 15:1 compression on some real-world data sets, but others (such as
storing email addresses in a varchar column) actually expand on storage. This
is why I think of this as a solution for structured, quantified event logs, not
for general unstructured log file storage.