Fishpool

To content | To menu | To search

Wednesday 12 December 2012

A marriage of NoSQL, reporting and analytics

Earlier today, I mentioned querymongo.com in a tweet that fired off a chat covering various database-related topics each worth a blog post of their own, some of which I've written about here before:

One response in particular stood out as something I want to cover in a bit more detail that will fit in a Tweet:

While it's fair to say I don't think MongoDB's query syntax is pretty in the best of circumstances, I do agree that at times, given the right kind of other tools your dev team is used to (such as, when you're developing in a JavaScript-heavy HTML5 + Node.js environment) and the application's context is one where objects are only semi-structured, it can be a very good fit as the online database solution. However, as I was alluding to in the original tweet and expounded on in its follow-ups, it's an absolute nightmare to try to use MongoDB as the source for any kind of reporting, and most applications need to provide reporting at some point. When you get there, you will have three choices:

  1. Drive yourselves crazy by trying to report from MongoDB, using Mongo's own query tools.
  2. Push off reporting to a 3rd party service (which can be a very, very good idea, but difficult to retrofit to contain all of your original data, too).
  3. Replicate the structured part of your database to another DBMS where you can do SQL or something very SQL-like, including reasonably accessible aggregations and joins.

The third option will unfortunately come with the cost of having to maintain two systems and making sure that all data and changes are replicated. If you do decide to go that route, please do yourself a favor and pick a system designed for reporting, instead of an OLTP system that can simply do reporting, when pushed to do so. Yes, that latter category includes both Postgres and MySQL - both quite capable as OLTP systems, but you already decided to do that using MongoDB, didn't you?

Most reporting tasks are much better managed using a columnar, analytics-oriented database engine optimized for aggregations. Many have spawned in the last half a decade or so: Vertica, Greenplum, Infobright, ParAccel, and so on. It used to be that choosing to use one might be either complicated or expensive (though I'm on record saying Infobright's open source version is quite usable), but since last week's Amazon conference and its announcements, there's a new player on the field: Amazon Redshift, apparently built on top of ParAccel, priced at $1000/TB/year. Though I've yet to have a chance to participate in its beta program and put it through its paces, I think it's pretty safe to say it's a tectonic shift on the reporting databases market as big or bigger as the original Elastic Compute Cloud was to hosting solutions. Frankly, you'd be crazy not to use it.

Now, reporting is reporting, and many analytical questions businesses need to solve today really can't be expressed with any sort of database query language. My own start-up, Metrify.io is working on a few of those problems, providing cloud-based predictive tools to decide how to serve customers before there's hard data what kind of customers they are. We back this with a wide array of in-memory and on-disk tools which I hope to describe in more detail at a later stage. From a practical "what should you do" point of view though -- unless you're also working on an analytics solution, leave those questions to someone who's focused on that, turn to SaaS services and spend your own time on your business instead.

Wednesday 30 May 2012

Evolution of database access - are we making a full cycle?

10gen's new massive funding round ($42M, great job guys) for MongoDB development, monitoring and support services has again caused a cascade of database-related posts in my Twitter feed and elsewhere. Many of these seem a bit one-sided to me, I think it would be good to look at where we came from before we decide where the future will take us.

SQL is built from the concepts of set theory. It's great when you're accessing aggregates, intersections, subsets, and so forth. It was developed for a real need: earlier databases forced a key-by-key API on people who were not necessarily as interested in individual records as they were in collections of those records. SQL also supports manipulation of individual records, but that's not at all where its powerful features reside.

Early databases stuck an SQL interface on record-based storage systems, which then continued to evolve to support various kinds of indexing and partitioning strategies for higher performance. These turned out to be much more reliable and easier to manage than most contemporary alternatives (such as files, horror), so applications began to be built on top of a database accessed via SQL. Many such applications were really using the database for storage of individually managed records, thus using only the least-powerful features of SQL.

Along came ORM tools, designed to hide the boring SQL bits from the application developer, whose interest was in manipulating and displaying individual records. Sadly, since ORM lives between the application and the database, changes on either side would still need to be manually reflected on the other - which is why "SQL makes developers unhappy, and MongoDB does not". The lesson here is very simple: if your application manages individual records, such as users, products, orders, etc, develop it with technologies which make record manipulation easy, and allow easy evolution of the record schema. MongoDB is great at this. It also scales well and is pretty easy to manage. It's not the only one of its kind, but it's good at this in a way most row-based SQL databases (MySQL, PostgreSQL, Oracle, MS SQL, etc) will never be.

But SQL is still great at set theory. Reporting, analytics, etc still need aggregates more than record-by-record access to data. MongoDB is dreadful as a back-end for complex analytics (a sub-standard MapReduce interface is not a solution). Its storage model is designed for real-time access and memory-resident objects and thus is really un-optimal for truly large scale data storage. Any data for which the primary use case is aggregate reporting or complex analytics, such as event metrics (like what we do at Metrify.io) needs something else in addition to records. Columnar engines with SQL query frontends are a much better fit. They'll compress massive data sets to much smaller storage requirements, thus improving the aggregate performance over terabytes, scale query execution over cores and nodes (thanks to dealing with much larger, thus easier-to-split data set at once), and retain a business analyst interface that is much friendlier than an object API.

I do agree on one part of the NoSQL idea: for record-by-record application models, SQL is an unnecessary middle-man. Just don't forget there are other models, too.