Fishpool

To content | To menu | To search

Wednesday 12 December 2012

A marriage of NoSQL, reporting and analytics

Earlier today, I mentioned querymongo.com in a tweet that fired off a chat covering various database-related topics each worth a blog post of their own, some of which I've written about here before:

One response in particular stood out as something I want to cover in a bit more detail that will fit in a Tweet:

While it's fair to say I don't think MongoDB's query syntax is pretty in the best of circumstances, I do agree that at times, given the right kind of other tools your dev team is used to (such as, when you're developing in a JavaScript-heavy HTML5 + Node.js environment) and the application's context is one where objects are only semi-structured, it can be a very good fit as the online database solution. However, as I was alluding to in the original tweet and expounded on in its follow-ups, it's an absolute nightmare to try to use MongoDB as the source for any kind of reporting, and most applications need to provide reporting at some point. When you get there, you will have three choices:

  1. Drive yourselves crazy by trying to report from MongoDB, using Mongo's own query tools.
  2. Push off reporting to a 3rd party service (which can be a very, very good idea, but difficult to retrofit to contain all of your original data, too).
  3. Replicate the structured part of your database to another DBMS where you can do SQL or something very SQL-like, including reasonably accessible aggregations and joins.

The third option will unfortunately come with the cost of having to maintain two systems and making sure that all data and changes are replicated. If you do decide to go that route, please do yourself a favor and pick a system designed for reporting, instead of an OLTP system that can simply do reporting, when pushed to do so. Yes, that latter category includes both Postgres and MySQL - both quite capable as OLTP systems, but you already decided to do that using MongoDB, didn't you?

Most reporting tasks are much better managed using a columnar, analytics-oriented database engine optimized for aggregations. Many have spawned in the last half a decade or so: Vertica, Greenplum, Infobright, ParAccel, and so on. It used to be that choosing to use one might be either complicated or expensive (though I'm on record saying Infobright's open source version is quite usable), but since last week's Amazon conference and its announcements, there's a new player on the field: Amazon Redshift, apparently built on top of ParAccel, priced at $1000/TB/year. Though I've yet to have a chance to participate in its beta program and put it through its paces, I think it's pretty safe to say it's a tectonic shift on the reporting databases market as big or bigger as the original Elastic Compute Cloud was to hosting solutions. Frankly, you'd be crazy not to use it.

Now, reporting is reporting, and many analytical questions businesses need to solve today really can't be expressed with any sort of database query language. My own start-up, Metrify.io is working on a few of those problems, providing cloud-based predictive tools to decide how to serve customers before there's hard data what kind of customers they are. We back this with a wide array of in-memory and on-disk tools which I hope to describe in more detail at a later stage. From a practical "what should you do" point of view though -- unless you're also working on an analytics solution, leave those questions to someone who's focused on that, turn to SaaS services and spend your own time on your business instead.

Saturday 1 September 2012

Make Amazon EC2 control go faster

Do you run enough EC2 systems to care about the time it takes to start one or check its status, but not enough to justify an account at Scalr or RightScale to manage them? Do you care about automating instance management? Are you working in a team where several people need to provision or access servers? If so, consider this quickstart to a better way of setting up you cluster. Apologies for the acronym soup - blame Amazon Web Services for that.

  1. Turn on IAM for your AWS account, so that you can create an account for every team member separately. While you're there, I'd also recommend you turn on Multi Factor Authentication (MFA) for each account. You can use Google Authenticator (Android or iPhone) to provide you the MFA tokens, even if you're not securing your Google account with it. Thanks, Tuomo, for pointing that out, I had thought MFA depended on keyfob tokens.
  2. Don't leave IAM yet. Go to the Roles tab and create a new Role (I call mine InstanceManager) with a Power User Access policy template.
  3. Move on to the EC2 management console and create a new instance. It has to be a new one, you can't associate this awesome power with anything you have already. For practice, use the Quick Launch Wizard -- I'll go through this step by step.
  4. Name your instance. I call mine Master. Lets assume you already have a Key Pair you know how to use with EC2. Choose that.
  5. Choose the Amazon Linux AMI 2012.03 as your launch config. Hey, it's a decent OS, and if you like Ubuntu better, you can repeat this with your favorite AMI later once you know how it works.
  6. Choose Edit details on the next page of the wizard. We'll do changes in several places.
  7. Make the Type t1.micro, you don't want to do much more than manage other instances on this one so it doesn't need a lot of oomph. I would recommend turning on Termination Protection to avoid a silly mistake later on.
  8. Tag it like you wish
  9. If you're using Security Groups (warmly recommended!), choose one which has access to your other servers.
  10. Here's the important bit: on the Advanced Details tab, choose the IAM role you created earlier (InstanceManager, if you followed my naming).
  11. No need to change anything in the Storage config. Click Save details, then Launch.

The instance will come up like any other, you'll probably know how that works. If you're used to something else than Amazon Linux, this one expects you to log in as ec2-user, and you can sudo from there to root. Set up your own account and secure the box to your best effort, since this one holds the keys to everything you're running on EC2.

Now, why did we do all this?

  1. Log in. With a regular account, no root, no keys copied from anywhere else.
  2. Type ec2-describe-instances to the shell.
  3. Witness a) fast response b) with all your instances listed. a) comes from running inside the AWS data center, and b) is the IAM Role magic.
  4. Rejoice how your teammates will not need to manage their own access secrets. You did secure the master account and SSH to this box, right?
  5. Try to launch something else. Yup, it all works.

Setting up the IAM Role and associating one to an instance through the command line is somewhat more involved, so this is much easier to do from the web console as above. The IAM docs do tell how, but I wasted an hour or two getting my head wrapped around why the console talked of roles, but the API and command line needed profiles (answer: the console hides the profiles beneath defaults). If you wish to have your own applications manage pieces of the AWS infrastructure, and hate the hoops you have to jump through to pass the required access keys around, IAM Roles are what you're looking for, and you'll want to read up on the API in a bit more detail. Now you've been introduced.