Three domains of data
By Osma on Thursday 23 April 2009, 18:15 - Permalink
My MySQL Conference presentation on Tuesday discussed my practical findings on how Infobright's technology works in developing a MySQL-based data warehouse. I also touched on a more high-level question of how to select a technology for a different kinds of data-related problem areas, and this article expands on that discussion.
Most models of building applications have not really taken these changes in the balance into account. In particular, database designs continue to focus on 3NF OLTP schemas to the exclusion of considering data classes and to the detriment of the overall application. For a large and growing class of web applications, the database has not been the only data layer for some time, but the division of labor has not really been clear in application design. Here is one way to look at that division.
The primary data pattern,
represented by the top of the illustrated triangle, is characterized by the
requirements high availability and rapid access, often measured in
sub-millisecond level. The first requirement demands redundant distribution of
data, while the latter excludes rotating media as the primary storage. On the
other hand, for many of these applications cross-session consistency
requirements can be relaxed slightly, making aggressive caching an attractive
solution. Since scaling application performance already requires increasing the
number of CPU cores and memory is relatively plentiful compared to the late-90s
designs, this has made clustered caches with local, network-synchronized copy
of data the design best practice. Memcached is a clear example of this model,
while application developers are waiting for more sophisticated mechanisms like
MySQL Cluster 7 to mature.
The second pattern, formerly considered the primary domain, is the ACID-compliant transactional database, at the left corner of my triangle. Compared to the former area, consistency, even across sessions and users is a critical requirement, while the demand of sub-millisecond access can be somewhat relaxed at least for the worst-case, if not on average. For example, management of transactions with direct monetary value attached to them continues to reside within this domain. It is never acceptable for the total amount of recorded value to decrease or increase without external audit trail, nor is it ok for the ownership of money to unexplainably change. Even the continuous availability of the data can be considered a secondary requirement to its consistency and durability.
The final corner of this model is the event record or data time-series for purposes ranging from audit to analytics and business intelligence. While again durability, integrity and consistency of data are of paramount interest here, the desired performance characteristics are radically different. Storage of new data can not slow down or disrupt other parts of the application while access to stored data is typically in summary form and can commonly tolerate seconds, sometimes even minutes (or more!) of latency. This bifurcated performance requirement makes the typical OLTP database an entirely impractical solution, while the extreme volumes of data (at rates of tens or hundreds of thousands, even millions of new events per second) exclude the memory-based model.
It's for this last class of applications I believe the column-store technologies, one of which I described as the log processing solution earlier are a great fit. I think it's high time for application frameworks, best practices and design guidelines to take this into account, and for us to retire the concept of a row-based transactional OLTP database, accessed via ORM libraries like Hibernate or ActiveRecord as the exclusive method of manipulating application data. This is not to say that the OLTP model doesn't continue to be a crucial part of the overall solution - it continues to be an equal partner in this triplet, but no longer to the exclusion of other technologies.