Behind the Scenes at QueryTree: The Anatomy of a Web Based Data Tool

QueryTree is a web based drag and drop tool for doing simple data analysis. Users drag and connect tools in a chain to first load in data, then manipulate that data (filter, group, sort etc.) and then visualise that data on a chart or map. I started building QueryTree by myself in between freelance projects about a year ago. More recently I’ve been able to devote more time to it and hire in another developer to help me. In this post I talk about some of the tools and technologies we used to build QueryTree and why.

The User Interface

QueryTree’s drag and drop interface was the first part of the tool to be developed, there was a time when I would do demos just off a single static html file with some JavaScript in – which involved a certain amount of smoke and mirrors! The tool is all about the interface and I wanted to see what people’s reaction to it was. Suffice to say that people like it and that encouraged me to continue working towards an MVP.

The UI is built using a HTML page and a few of JavaScript frameworks, these are:

  • RequireJS – this handles the modularisation of all the JavaScript and makes it easy to load new scripts asynchronously. Having modules for our stuff has been great, but getting other JavaScript frameworks integrated into Require has been a real pain. I’ve used a variety of tweaks to the require shim to set up dependencies, exported symbols and in some cases just wrapping third party libraries in require modules to keep them happy.
  • jQuery – naturally. I’m also using some jQuery UI elements such as dialog, draggable and button.
  • Knockout – this lets us template the HTML markup that we want and then bind those templates to JavaScript object models. There are “purer” frameworks such as Backbone or Angular but Knockout had a great tutorial which is what first got me into it and the framework has a minimal feel which suited my incremental development process. I started with some very simple HTML, I knockout-ified it, I added some more HTML, and somewhere along the way the UI QueryTree got built. I really like it and have never found it was holding me back.
  • Fabriq.js – this framework wraps the standard canvas API and gives you some extras like an object model for your shapes and SVG rendering. It does some extras like event handling but I’m not currently using those as the jQuery events were more familiar.
  • Flot – this is a JavaScript charting library. It uses the canvas element and I’m using it primarily because its fast. One of the things about QueryTree is that the user has complete flexibility to define whatever charts they like, which means they could do something stupid like try and plot 100,000 points. There are some charting libraries that I tried which have plenty of features, but which ground to a complete halt if the user did something bad.
  • Less – I’ve worked on so many projects were the CSS became a real mess and nobody dared refactor it for fear of never being able to get things looking right again. So for this project I took the decision to be a bit more organised from the start. Using less gave us a couple of weapons with which to beat back the CSS spaghetti: the nesting was most useful though, it enabled us to be specific about which elements we wanted to style, without adding lots of code.
  • Handsontable – The table tool uses this plugin to display an editable table to the user. Crucially, it has support for copying and pasting in from spreadsheets and HTML tables, which is fantastic.
  • Some free SVG icons
  • This reset.css script

The Middle Tier

Once the UI was mostly built and demoing well to interested parties I started on the middle tier. My language of choice these days is Python and as I’d been taking advantage of Google App Engine‘s free usage allowance for hosting my static HTML and JS it was totally natural (almost accidental) to start adding bits in python. As is standard on App Engine, we’re using Jinja for templating and WebApp2 for HTTP handlers.

First off, I started working on a JSON API that the UI could use to save the current state of the worksheet and to request the data for a particular tool. From the UI side I built objects (subsets of the Knockout model objects) then just stringified them before using jQuery to POST them to a URL. At the server end I started out using json.loads and json.dumps and have never looked back. It just works and it’s simple so I never felt the need to use a framework for my JSON API.

For user settings and worksheet setup I just used the Google App Engine Data Store. I felt nervous about locking us into App Engine but again, it’s so easy to get hooked. Although, if we had to, I’m sure we could replace it with something like MongoDB, it’s basically just a framework for persisting python objects so the GQL queries and the .put() methods are the only places where you really interact with it.

When it comes to data processing, well, we cheated. Each worksheet in QueryTree is allocated to a MySQL database and any data that you upload to a worksheet is stored in a special table for that tool. When you add chains of data manipulation tools to that worksheet and connect them to a data source the python code “compiles” that into SQL and runs it on the MySQL database. So, in effect, QueryTree is just an SQL development environment. When you click on a tool, the query is run and your data returned as JSON objects over HTTP, which the UI then renders in the results area at the bottom of the screen.

The Back End

As I said above, all the data ends up in a MySQL database somewhere. We have a system for allocating worksheets to a database and can grow the pool of databases as required. We use cloud hosted MySQL but could manage our own if we wanted, the web app just needs to know how to connect to the databases. We have workers which can clear down tables that are no longer being used to free up space too.

Keeping the data in this way does place an upper limit on the amount of data that QueryTree can handle on one worksheet to however much data a single MySQL database can hold. In practice though, that data has to come over HTTP to get into QueryTree, either as a file upload or web fetch, so the databases are not the limiting factor.

Other Bits

In no particular order, we’re using the following additional tools:

  • Joyride – Our product tour is built using Joyride, a JavaScript framework for scripting tour popups
  • Paymill – We take payments using Paymill.
  • Google Maps – The map tool renders a Google map in the results area.
  • Testing Framework – We do automated testing of the full stack using a testing framework that I wrote myself (along the lines of qUnit). So much of the functionality of QueryTree is spread out across the UI, middle tier and the SQL that is generated, or in how these layers interact, that simply testing one layer didn’t add enough value. I could have written 1000s of tests for the UI alone or for the python middle tier alone, but they would not have given me any comfort about deploying a new version to live. So I built a framework which drives the application from the UI and which exercises the whole stack. If those tests pass, I can have some confidence in the whole system being basically functional.
  • Bash – Whenever we want to deploy to live we type ./deploy.sh.

The Future

Aside from adding more tools and  more options on the existing tools, there are three areas of the overall architecture that I’d like to improve:

  • Performance
    There’s a lot we can do to improve the performance of queries. We can automatically index tables (especially tables from uploaded files because the data isn’t changing) based on the columns that users’ queries are scanning a lot. We can also tune the queries that we generate, we’ve just made it work so far, we haven’t really thought about performance yet at all.
  • Interfaces
    We have the web data tool for pulling data in from URLs but for non-technical users I’d like to add tools that fetch data from specific services such as Twitter, ScraperWiki or Nimble.com. These will be built on the same underlying tools as the web data tool but will be hard coded around the particular structure of the API in question. This will make it easier for users to pull meaningful data into their worksheets from third party sources.
  • Tool SDK
    Longer term I’d like to give people the ability to build their own tools and load them into QueryTree. The system is quite modular so coding up a new tool really just involves some JavaScript to define the tool at the client side, a HTML options dialog and a python object to define how the back end turns the settings into a SQL query. The only real challenge is productising it all, locking down the code that outsiders can upload and building the necessary development tools.

Looking back over this list, it struck me just how much I’m standing on the shoulders of giants here. QueryTree is a thin layer on top of the lot of third party components and my job has been simply to glue them all together in a way that is hopefully useful to others.

 


I'd love to meet you on twitter here.

Exporting all Sheets on a Spreadsheet to a Single CSV

If you have a spreadsheet with multiple sheets/tabs containing similar tables of data; and you want to export the whole lot to a single CSV, then this VBA macro should help:

Sub ExportAllSheetsToSingleCSV()
    'The file to write to
    outputFile = "C:\Users\dan\output.csv"
    f = FreeFile()
    Dim headerLine As String
    Open outputFile For Output As f
    For Each Sheet In Worksheets
        For Each Row In Sheet.Rows
            Dim line As String
            line = ""
            Dim sep As String
            sep = ""
            Dim lineIsNonEmpty As Boolean
            lineIsEmpty = True
            
            'Work through all cells on this row
            For Each cell In Row.Cells
                If cell <> "" Then
                    line = line & sep & cell
                    sep = ","
                    lineIsEmpty = False
                End If
            Next
            
            'Did we find anything
            If lineIsEmpty = False Then
                'Dont write the header line out multiple times
                If headerLine <> line Then
                    Print #f, line
                End If
                
                'Set the header line to the first non empty line we find
                If headerLine = "" Then
                    headerLine = line
                End If
            End If
        Next
    Next
    Close #f
End Sub

I'd love to meet you on twitter here.

Parsing Large CSV Blobs on Google App Engine

When parsing a blob on Google App Engine using the Python CSV library, the simplest approach is to pass the BlobReader straight into the CSV reader. However, unlike when with opening a normal file, there is no option to handle universal newline characters. In order to handle all the different kinds of newline characters, the string’s splitlines method can be used. However, doing that without loading the entire file into memory can be tricky. Google recommends blobs should be read 1MB at a time, so ideally, you could load 1MB into a buffer, split the lines and then feed the CSV reader one line at a time. That’s what this class does:

class BlobIterator:
    """Because the python csv module doesn't like strange newline chars and
    the google blob reader cannot be told to open in universal mode, then
    we need to read blocks of the blob and 'fix' the newlines as we go"""

    def __init__(self, blob_reader):
        self.blob_reader = blob_reader
        self.last_line = ""
        self.line_num = 0
        self.lines = []
        self.buffer = None

    def __iter__(self):
        return self

    def next(self):
        if not self.buffer or len(self.lines) == self.line_num + 1:
            self.buffer = self.blob_reader.read(1048576)  # 1MB buffer
            self.lines = self.buffer.splitlines()
            self.line_num = 0

            # Handle special case where our block just happens to end on a new line
            if self.buffer[-1:] == "\n" or self.buffer[-1:] == "\r":
                self.lines.append("")

        if not self.buffer:
            raise StopIteration

        if self.line_num == 0 and len(self.last_line) > 0:
            result = self.last_line + self.lines[self.line_num] + "\n"
        else:
            result = self.lines[self.line_num] + "\n"

        self.last_line = self.lines[self.line_num + 1]
        self.line_num += 1

        return result

Having defined this class, you can call it like this:

    blob_reader = blobstore.BlobReader(blob_key)
    blob_iterator = BlobIterator(blob_reader)
    reader = csv.reader(blob_iterator)

The BlobIterator supports the __iter__ convention but behind the scenes loads 1MB of the blob into memory, splits the lines and then keeps track of the last partial line so it can combine it with the first partial line of the next 1MB block.

UPDATE: 2014-08-14: Many thanks to Javier Carrascal for his help in spotting an issue with the first version of the BlobIterator. The code above has been updated with a fix. His post explains the process he went through.


I'd love to meet you on twitter here.

BigData: Size isn’t Important, it’s What You Do With it That Counts

Lets say you run a medium sized business. You employ 1000 people and have 5000 customers. Heard about “BigData”? Think you should probably be doing some of that too? Well, don’t start building that Hardoop cluster just yet, because you probably don’t need it…

To illustrate, let’s imagine all 1000 of your employees spend 40 hours per week typing in data at 30 words per minute. Some back of an envelop calculation says that your 1000 employees will be generating 67 GB per year[1]. Lets allow for inefficient file formats and double it to 134 GB. With that amount of data we’re not even troubling the instance of MySQL that I’ve installed locally on my laptop.

OK, so maybe you’re collecting data from your customers and you think that that means you must need  to “do BigData”. Lets add all your 5000 customers into that equation. Assuming all your customers join in with the data generation effort and do nothing but generate data for you, you now have 402 GB per year. We’ll double it again and say that you have 804 GB of data per year. Do we need that BigData infrastructure yet?!

Well, no, not really. At 804 GB data growth per year, your developers and DBAs are going to have to think seriously about storage, archiving and query performance, but we’re well within the sort of data volume that a traditional relation database can cope with. It isn’t time to break out the map reduce libraries just yet.

BigData is cool right now. Some would say overhyped. And developers are always pushing to work with the latest cool gadgets. But for the vast majority of business out there, they’re not even close to genuinely needed all these high end high performance data crunching platforms, and the headaches that come with them.

There are some exceptions of course:

  • You have literally millions of customers
    If you’re Tesco and you have millions of people each buying hundreds of items every day, then you really do have a lot of data.
  • Modelling or experimental data
    Are you running experiments in a wind tunnel or running computer models of physical phenomena. If so then, yes, BigData tools are for you.
  • Crunching of third party data-sets
    Did you buy a 300 TB dataset off someone? If so then you’re going to need some serious infrastructure to handle it.

But for the rest of us: The focus should not be on building an impressive 3000 node cluster on EC2, it should be what you might call “Data Driven Business”. It’s A/B testing, it’s profiling of real customer behaviour, it’s making decisions based on scientifically run experiments rather than anecdotal evidence. If you’re doing that then you’re on the right track. The datasets might not be Terra-bytes in size, but that doesn’t matter. Size isn’t important, it’s what you do with it that counts.

 

[1] 1000 employees * 40 hours per week * 200 working days a year * 60 minutes per hour * 30 words per minute * 5 bytes per average word


I'd love to meet you on twitter here.

Free Speech, the Press and #Leveson

It’s supposed to work like this:

  • You can say or write anything, no matter who it hurts
  • If what you say is wrong, someone else is free to say so and expose you
  • If you’re exposed as being wrong, people stop listening to you

This is the basic idea of free speech. Many people have tried in the past to prevent people saying things that are not “true”, this usually fails and almost always involves persecution of innocent people.

If the Roman Catholic Church could not keep a lid on the idea that the Earth is not the centre of the universe, how much harder will it be, in this internet age, to prevent people from getting a message out there.

But it occurs to me that the much of what’s wrong with the press is not about truth. Almost everyone agrees that what’s written in the tabloid press is not true, or at least rarely true. Yet people buy them anyway. Why? Because they’re entertaining.

Say what you like about the morals of people who buy tabloids and fund this system. Or the people who willingly offer themselves up to this soap opera as celebrities for celebrity’s sake; we have to assume there’s nothing we can do to stop them. Instead we should focus on how to prevent people from being harassed by the press when they don’t want to take part in this ecosystem. How do we protect the actor, musician or victim of horrific crimes from becoming the raw material on which tabloid fiction is based?

If Leveson simply recommends regulation to prevent the publication of something that isn’t true, for me, that will be a failure. The ability to decide on matters of truth are above the pay grade of any public official or regulator, and it will do nothing to stop the press harassing people in the pursuit of that “truth”.

A cleverer system would be one in which people who’ve been the subject of a press story can invoice the papers for the use of their image and identity. A formula would be needed which factored in column inches used, newspaper circulation and a premium applied if approval was  not granted, with a rebate if you’re later found guilty of a crime that the press were accusing you of. And an ombudsman could enforce the collection of these image usage fees and rebates.

With such a system in place, if the press really really want to talk about you in a story, they can. They just have to pay you every time they mention your name or include a picture of you. If, for example, a paper is full of stories about the parents of a murdered child who do not want to be in the press, that paper would simply be unprofitable. There would be no point printing that paper. Whereas, if a celebrity glamour model wants to get coverage, she could choose to give consent and thus reduce the papers costs as an incentive. If the press uncover corruption or criminal activity they will have to pay you initially, but will get the money back if you are later convicted.

Put simply, lets make it unprofitable to fill newspapers with stories about people who haven’t done anything wrong and don’t want to be covered.


I'd love to meet you on twitter here.

Vim Tips

To search for the word under the cursor, type “*”.

To search for some text you’ve already yanked, type “/”, then “Ctrl+R”, then “0”.

To search for some text you’ve already yanked into register “a”, type “/”, then “Ctrl+R”, then “a”.

To search for some text you’ve copied into the clipboard, type “/”, then “Ctrl+R”, then “+”

Continue reading Vim Tips

Handling Method Parameter Default Values Using Moq

We at Contigo often use a pattern where methods with multiple nullable parameters retrieve collections of entities from a data store a bit like this:

IEnumerable GetThingsFromDatabase(int? intFilter = null, bool? boolFilter = null, stringFilter = null)
{
...
}

These methods treat a null value as “no filtering required” so they can be called with only those filters that are relevant to the calling code. In this

Continue reading Handling Method Parameter Default Values Using Moq

What Are You Optimising For?

Very few software designs/architectures are completely and objectively bad. Most designs optimise for something. Here’s a [probably incomplete] list of things that you could optimise for in software architecture:

  • Developer Time Now
    When we say we’re doing something “quick and dirty”, or that we are taking out “Technical Debt”, then we are optimising the development time now.
  • Developer Time Later
    This is what people traditionally mean by “good” code. A

Continue reading What Are You Optimising For?

Follow People on Hacker News with hn_comment_follow

I often find the comments on Hacker News are fantastic and there are certain users who’s opinion I always value. To that end I’ve created a python script to help follow what my favourite people are saying on Hacker News: hn_comment_follow. It’s on GitHub in case others would like to fork it.

To call it, invoke the script like this:   
    python hn_comment_follow.py pg patio11 d4nt

Continue reading Follow People on Hacker News with hn_comment_follow

Five Software Architect Antipatterns

I believe that all software design comes down to trade-offs, and the only way for software architects to get to the right decisions on those trade-offs is for them to have a very broad and senior role within their organisation. I’ve observed some people in the industry who are called architects but, for one reason or another, are not in a position to design software well. Here are five types

Continue reading Five Software Architect Antipatterns