To content | To menu | To search

Tag - Infobright

Entries feed - Comments feed

Wednesday 12 December 2012

A marriage of NoSQL, reporting and analytics

Earlier today, I mentioned in a tweet that fired off a chat covering various database-related topics each worth a blog post of their own, some of which I've written about here before:

One response in particular stood out as something I want to cover in a bit more detail that will fit in a Tweet:

While it's fair to say I don't think MongoDB's query syntax is pretty in the best of circumstances, I do agree that at times, given the right kind of other tools your dev team is used to (such as, when you're developing in a JavaScript-heavy HTML5 + Node.js environment) and the application's context is one where objects are only semi-structured, it can be a very good fit as the online database solution. However, as I was alluding to in the original tweet and expounded on in its follow-ups, it's an absolute nightmare to try to use MongoDB as the source for any kind of reporting, and most applications need to provide reporting at some point. When you get there, you will have three choices:

  1. Drive yourselves crazy by trying to report from MongoDB, using Mongo's own query tools.
  2. Push off reporting to a 3rd party service (which can be a very, very good idea, but difficult to retrofit to contain all of your original data, too).
  3. Replicate the structured part of your database to another DBMS where you can do SQL or something very SQL-like, including reasonably accessible aggregations and joins.

The third option will unfortunately come with the cost of having to maintain two systems and making sure that all data and changes are replicated. If you do decide to go that route, please do yourself a favor and pick a system designed for reporting, instead of an OLTP system that can simply do reporting, when pushed to do so. Yes, that latter category includes both Postgres and MySQL - both quite capable as OLTP systems, but you already decided to do that using MongoDB, didn't you?

Most reporting tasks are much better managed using a columnar, analytics-oriented database engine optimized for aggregations. Many have spawned in the last half a decade or so: Vertica, Greenplum, Infobright, ParAccel, and so on. It used to be that choosing to use one might be either complicated or expensive (though I'm on record saying Infobright's open source version is quite usable), but since last week's Amazon conference and its announcements, there's a new player on the field: Amazon Redshift, apparently built on top of ParAccel, priced at $1000/TB/year. Though I've yet to have a chance to participate in its beta program and put it through its paces, I think it's pretty safe to say it's a tectonic shift on the reporting databases market as big or bigger as the original Elastic Compute Cloud was to hosting solutions. Frankly, you'd be crazy not to use it.

Now, reporting is reporting, and many analytical questions businesses need to solve today really can't be expressed with any sort of database query language. My own start-up, is working on a few of those problems, providing cloud-based predictive tools to decide how to serve customers before there's hard data what kind of customers they are. We back this with a wide array of in-memory and on-disk tools which I hope to describe in more detail at a later stage. From a practical "what should you do" point of view though -- unless you're also working on an analytics solution, leave those questions to someone who's focused on that, turn to SaaS services and spend your own time on your business instead.

Saturday 3 October 2009

Some scaling observations on Infobright

A couple of days ago, Baron Schwartz posted some simple load and select benchmarking of MyISAM, Infobright and MonetDB, which Vadim Tkachenko followed up with a more realistic dataset and interesting figures where MonetDB beat Infobright in most queries.

Used to the parallel IEE loader, I was surprised by the apparent slow loading speed of Baron's benchmark and decided to try and replicate it. I installed Infobright 3.2 on my laptop (see, this is very unscientific) and wrote a simple perl script to generate and load an arbitrarily large data set resembling Baron's description. I'm not going to post my exact numbers, because this installation is severely resource-constrained below Infobright's recommended smallest installation. However, you can reproduce the results yourself with the attached script, and I will note some observations.

Continue reading...

Monday 21 September 2009

A peek under the hood in Infobright 3.2 storage engine

I've been meaning to post some real-world data on the performance of the Infobright 3.2 release which happened a few weeks ago after an extended release candidate period. We're just preparing our upgrades now, so I don't have any performance notes over significant data sets or complicated queries to post quite yet.

To make up for that, I decided to address a particular annoyance of mine in the community edition, first because it hadn't been addressed in the 3.2 release (and really, I'm hoping doing this would include it into 3.2.1), and second, simply because the engine being open source means I can. I feel being OSS is one of Infobright's biggest strengths, in addition to being a pretty amazing piece of performance for such a simple, undemanding package in general, and not making use of that would be shame. Read on for details.

Continue reading...

Wednesday 8 April 2009

Using the Infobright Community Edition for event log storage

Apart from the primary "here's how we ended up using Infobright for data warehousing and how is that working out" topic I'm going to discuss in my MySQL Conf presentation I'll touch on another application, the use of Infobright's open-source Community Edition server for collection and storage of event logs. This is a system we've implemented in the past couple of months to solve a number of data management problems that were gradually becoming problematic for our infrastructure.

We've traditionally stored structured event log data in databases for ease of management. Since Habbo uses MySQL for most everything else, putting the log tables in the same databases was pretty natural. However, there are significant problems to this approach:

  • MyISAM tables suffer from concurrency issues and crash-recovery is very slow due to table consistency check
  • InnoDB tables suffer from I/O bottlenecks and crash-recovery is very slow due to rollback segment processing
  • Both scale badly to hundreds of millions of rows (especially if indexing is required), and mixing them is not a recommended practice
  • Storage becomes an issue over time, especially as indexes can easily require many times as much disk than the data, and an event log is going to have a LOT of rows
  • Partitioning has only recently become available, and before that, managing "archive" tables needed manual effort
  • Perhaps worst of all (as it's very hard to measure), if any of this is happening on the primary DB servers, it's competing for buffer pool memory with transactional tables, thus slowing down everything due to cache misses

Over the years, we've tackled these issues in many ways. However, with our initial experience of scaling an Infobright installation for data warehousing needs, a pretty simple solution became apparent, and we rapidly implemented an asynchronous, buffered mechanism to stream data into an ICE database. We're early with this implementation, but it has turned out to be a satisfactory high-performance solution. Even better, it's a very simple thing to implement, even in a clustered service spanning many hosts, as long as log tables don't need to be guaranteed 100% complete or up-to-date to the last second. Here's a description of the simple solution; extending that to the complex solution providing those guarantees is left as an exercise to the reader.

Rather than running single INSERTs to a log table or writing lines to a text file log, each server buffers a small set of events, eg for the past second in a memory buffer. These are then sent over a message bus or lightweight RPC call to a log server, which writes them to a log file that is closed and switched to a new file after every megarow or every few minutes, whichever is smaller. A second process running on this log server wakes up periodically and loads each of these files (minus the last one, which is still being written to) into the database with LOAD DATA INFILE.

This has multiple general benefits:

  • Buffered messaging requires much less time on the "client" servers compared to managing database connections and executing thousands of small transactions
  • The asynchronous processing ensures database contention can not produce random delays to the normal service operation
  • Batch loading of text files is implemented by every DB server, so there's little in this implementation that is proprietary or dependent on any particular DB solution
Using the Infobright ICE as the backend database provides a number of additional specific benefits:
  • Excellent data load performance
  • No index management, yet capability to run queries on the data without first extracting it to another DB
  • No degradation of performance as deployment size grows, as would happen even to a MyISAM table should it have any indexes
  • Compressed storage, so less spinning iron required
  • Columnar datapack organization should not require table partitioning even over long periods
This works very well for structured events. For unstructured data, a different solution is required, which I will discuss at some later date.

Update: Mark Callaghan asked in the comments for some quantified details. We have not spent the time to produce repeatable benchmarks, so all I can offer on that front is anecdotal data - it's very conclusive for us, given it's addressing our real concerns, but less so for others. That said, ICE does not support inserts, only batch loads, so the solution had to be engineered to use that, which added some complexity, but brought orders of magnitude more performance. A simple benchmark run showed that the end-to-end performance for this exceeded 100,000 events per second when running all parts of the client-logserver-database chain on a single desktop machine.

Query performance depends on the queries made. Summary data is 2-3 orders of magnitude faster to access, the bigger the dataset, the bigger the performance benefit - but expecting that for single row accesses would disappoint badly. Storage compression varies wildly depending on the data in question -- we've seen up to 15:1 compression on some real-world data sets, but others (such as storing email addresses in a varchar column) actually expand on storage. This is why I think of this as a solution for structured, quantified event logs, not for general unstructured log file storage.

Wednesday 19 November 2008

Looking for a ETL engineer for our BI team

So, I mentioned earlier that I was looking at Infobright's Brighthouse technology as a storage backend for heaps and heaps of traffic and user data from Habbo. Turns out it works fine (now that it's in V3 and supports more of the SQL semantics), and we took it into use. Been pretty happy with that, and I expect to talk more about the challenge and our solution at the next MySQL Conference in April 2009.

However, our DWH team needs extra help. If you're interested in solving business analytics problems by processing lots of data and the idea of working in a company that leads the virtual worlds industry excites you, let us know by sending us an application. Thanks for reading!

Monday 15 September 2008

Infobright BI tools go open source

I've mentioned Infobright before as an interesting solution to getting more performance to BI analytics solutions. Today's news are interesting: Sun invests in the company, and the baseline product is open sourced. Too busy to write more about it today, but I'm certainly watching this one closely.

Tuesday 22 April 2008

MySQL Users Conference followup and MySQL's business model

Last week saw MySQL User Conference 2008 in Santa Clara, but I was not able to make time for it this year either. However, in the wake of Sun's acquisition of MySQL, it was very interesting to follow what was going on. A few things that caught my attention:

MySQL 5.1 is nearing General Availability and an interesting storage engine plugin ecosystem starts to emerge. It's this latter, but related event that I see as the first real sign of validation for MySQL's long-ago chosen path of pluggable storage systems instead of focused effort on making one good general-use engine.

Oracle/Innobase announced InnoDB Plugin for MySQL 5.1, which much-awaited features which promise a great deal of help for daily management headaches. More than that, InnoDB Plugin's release under GPL lifts quite a lot of the concern I'm sure many users like us have had about the future viability of InnoDB as MySQL storage engine.

A couple of data warehousing solutions are launched, also based on MySQL 5.1 -- Infobright is one I've already researched somewhat (looks very interesting, as soon as a few current limitations are lifted), Kickfire I know nothing about right now but would love to learn more of.

There's a huge amount of coverage graciously provided by Baron Schwartz that I have yet to fully browse through.

A few remarks by Mårten Mickos regarding MySQL's business model seem to have kicked up a bit of a sandstorm. I don't really understand why; I read these to just verify that the direction MySQL took last year is to continue this year as well. I don't see any major changes here regarding the licensing structure, software availability, or support models. Frankly, it seems like yet another case of Slashdot readers not reading, let alone understanding, what they're protesting against, and press following up on the noise.

I do understand the critique made against MySQL's chosen model, though. In fact, I went on record last September to say that I understand that critique. I still see the same issues here. I believe we represent a fairly common profile of a MySQL Enterprise customer in that what we want from it is not the bleeding-edge functionality but a stable, well-tested product that we can expect to get help for if something does go wrong. We don't see great value in having access to a version of software that isn't generally available to "less advanced" or more adventurous users for free in a community version. In fact, we see it as a negative that such functionality exists, because it hasn't received the community testing, feedback and improvements that makes great open source software as good as it is. While new functionality is interesting, and we're trying to spend time getting familiar with new stuff in order to use it in production later, it simply isn't prudent to put business-critical data in a system that hasn't received real-world testing by as large a community as possible (unless you have no other alternative, and then you takes your chances).

Yet it seems to me that this is essentially what Sun/MySQL continue to propose for the Enterprise customers by delivering "value add" functionality in a special version of the server or plugins to it, possibly in a closed-source form that further reduces transparency and introduces risk. Mårten, I'd prefer it to be otherwise. How can I help you change your mind about this?