Evolution of database access - are we making a full cycle?
10gen's new massive funding round ($42M, great job guys) for MongoDB development, monitoring and support services has again caused a cascade of database-related posts in my Twitter feed and elsewhere. Many of these seem a bit one-sided to me, I think it would be good to look at where we came from before we decide where the future will take us.
SQL is built from the concepts of set theory. It's great when you're accessing aggregates, intersections, subsets, and so forth. It was developed for a real need: earlier databases forced a key-by-key API on people who were not necessarily as interested in individual records as they were in collections of those records. SQL also supports manipulation of individual records, but that's not at all where its powerful features reside.
Early databases stuck an SQL interface on record-based storage systems, which then continued to evolve to support various kinds of indexing and partitioning strategies for higher performance. These turned out to be much more reliable and easier to manage than most contemporary alternatives (such as files, horror), so applications began to be built on top of a database accessed via SQL. Many such applications were really using the database for storage of individually managed records, thus using only the least-powerful features of SQL.
Along came ORM tools, designed to hide the boring SQL bits from the application developer, whose interest was in manipulating and displaying individual records. Sadly, since ORM lives between the application and the database, changes on either side would still need to be manually reflected on the other - which is why "SQL makes developers unhappy, and MongoDB does not". The lesson here is very simple: if your application manages individual records, such as users, products, orders, etc, develop it with technologies which make record manipulation easy, and allow easy evolution of the record schema. MongoDB is great at this. It also scales well and is pretty easy to manage. It's not the only one of its kind, but it's good at this in a way most row-based SQL databases (MySQL, PostgreSQL, Oracle, MS SQL, etc) will never be.
But SQL is still great at set theory. Reporting, analytics, etc still need aggregates more than record-by-record access to data. MongoDB is dreadful as a back-end for complex analytics (a sub-standard MapReduce interface is not a solution). Its storage model is designed for real-time access and memory-resident objects and thus is really un-optimal for truly large scale data storage. Any data for which the primary use case is aggregate reporting or complex analytics, such as event metrics (like what we do at Metrify.io) needs something else in addition to records. Columnar engines with SQL query frontends are a much better fit. They'll compress massive data sets to much smaller storage requirements, thus improving the aggregate performance over terabytes, scale query execution over cores and nodes (thanks to dealing with much larger, thus easier-to-split data set at once), and retain a business analyst interface that is much friendlier than an object API.
I do agree on one part of the NoSQL idea: for record-by-record application models, SQL is an unnecessary middle-man. Just don't forget there are other models, too.