Behind the Scenes at QueryTree: The Anatomy of a Web Based Data Tool

QueryTree is a web based drag and drop tool for doing simple data analysis. Users drag and connect tools in a chain to first load in data, then manipulate that data (filter, group, sort etc.) and then visualise that data on a chart or map. I started building QueryTree by myself in between freelance projects about a year ago. More recently I’ve been able to devote more time to it and hire in another developer to help me. In this post I talk about some of the tools and technologies we used to build QueryTree and why.

The User Interface

QueryTree’s drag and drop interface was the first part of the tool to be developed, there was a time when I would do demos just off a single static html file with some JavaScript in – which involved a certain amount of smoke and mirrors! The tool is all about the interface and I wanted to see what people’s reaction to it was. Suffice to say that people like it and that encouraged me to continue working towards an MVP.

The UI is built using a HTML page and a few of JavaScript frameworks, these are:

  • RequireJS – this handles the modularisation of all the JavaScript and makes it easy to load new scripts asynchronously. Having modules for our stuff has been great, but getting other JavaScript frameworks integrated into Require has been a real pain. I’ve used a variety of tweaks to the require shim to set up dependencies, exported symbols and in some cases just wrapping third party libraries in require modules to keep them happy.
  • jQuery – naturally. I’m also using some jQuery UI elements such as dialog, draggable and button.
  • Knockout – this lets us template the HTML markup that we want and then bind those templates to JavaScript object models. There are “purer” frameworks such as Backbone or Angular but Knockout had a great tutorial which is what first got me into it and the framework has a minimal feel which suited my incremental development process. I started with some very simple HTML, I knockout-ified it, I added some more HTML, and somewhere along the way the UI QueryTree got built. I really like it and have never found it was holding me back.
  • Fabriq.js – this framework wraps the standard canvas API and gives you some extras like an object model for your shapes and SVG rendering. It does some extras like event handling but I’m not currently using those as the jQuery events were more familiar.
  • Flot – this is a JavaScript charting library. It uses the canvas element and I’m using it primarily because its fast. One of the things about QueryTree is that the user has complete flexibility to define whatever charts they like, which means they could do something stupid like try and plot 100,000 points. There are some charting libraries that I tried which have plenty of features, but which ground to a complete halt if the user did something bad.
  • Less – I’ve worked on so many projects were the CSS became a real mess and nobody dared refactor it for fear of never being able to get things looking right again. So for this project I took the decision to be a bit more organised from the start. Using less gave us a couple of weapons with which to beat back the CSS spaghetti: the nesting was most useful though, it enabled us to be specific about which elements we wanted to style, without adding lots of code.
  • Handsontable – The table tool uses this plugin to display an editable table to the user. Crucially, it has support for copying and pasting in from spreadsheets and HTML tables, which is fantastic.
  • Some free SVG icons
  • This reset.css script

The Middle Tier

Once the UI was mostly built and demoing well to interested parties I started on the middle tier. My language of choice these days is Python and as I’d been taking advantage of Google App Engine‘s free usage allowance for hosting my static HTML and JS it was totally natural (almost accidental) to start adding bits in python. As is standard on App Engine, we’re using Jinja for templating and WebApp2 for HTTP handlers.

First off, I started working on a JSON API that the UI could use to save the current state of the worksheet and to request the data for a particular tool. From the UI side I built objects (subsets of the Knockout model objects) then just stringified them before using jQuery to POST them to a URL. At the server end I started out using json.loads and json.dumps and have never looked back. It just works and it’s simple so I never felt the need to use a framework for my JSON API.

For user settings and worksheet setup I just used the Google App Engine Data Store. I felt nervous about locking us into App Engine but again, it’s so easy to get hooked. Although, if we had to, I’m sure we could replace it with something like MongoDB, it’s basically just a framework for persisting python objects so the GQL queries and the .put() methods are the only places where you really interact with it.

When it comes to data processing, well, we cheated. Each worksheet in QueryTree is allocated to a MySQL database and any data that you upload to a worksheet is stored in a special table for that tool. When you add chains of data manipulation tools to that worksheet and connect them to a data source the python code “compiles” that into SQL and runs it on the MySQL database. So, in effect, QueryTree is just an SQL development environment. When you click on a tool, the query is run and your data returned as JSON objects over HTTP, which the UI then renders in the results area at the bottom of the screen.

The Back End

As I said above, all the data ends up in a MySQL database somewhere. We have a system for allocating worksheets to a database and can grow the pool of databases as required. We use cloud hosted MySQL but could manage our own if we wanted, the web app just needs to know how to connect to the databases. We have workers which can clear down tables that are no longer being used to free up space too.

Keeping the data in this way does place an upper limit on the amount of data that QueryTree can handle on one worksheet to however much data a single MySQL database can hold. In practice though, that data has to come over HTTP to get into QueryTree, either as a file upload or web fetch, so the databases are not the limiting factor.

Other Bits

In no particular order, we’re using the following additional tools:

  • Joyride – Our product tour is built using Joyride, a JavaScript framework for scripting tour popups
  • Paymill – We take payments using Paymill.
  • Google Maps – The map tool renders a Google map in the results area.
  • Testing Framework – We do automated testing of the full stack using a testing framework that I wrote myself (along the lines of qUnit). So much of the functionality of QueryTree is spread out across the UI, middle tier and the SQL that is generated, or in how these layers interact, that simply testing one layer didn’t add enough value. I could have written 1000s of tests for the UI alone or for the python middle tier alone, but they would not have given me any comfort about deploying a new version to live. So I built a framework which drives the application from the UI and which exercises the whole stack. If those tests pass, I can have some confidence in the whole system being basically functional.
  • Bash – Whenever we want to deploy to live we type ./deploy.sh.

The Future

Aside from adding more tools and  more options on the existing tools, there are three areas of the overall architecture that I’d like to improve:

  • Performance
    There’s a lot we can do to improve the performance of queries. We can automatically index tables (especially tables from uploaded files because the data isn’t changing) based on the columns that users’ queries are scanning a lot. We can also tune the queries that we generate, we’ve just made it work so far, we haven’t really thought about performance yet at all.
  • Interfaces
    We have the web data tool for pulling data in from URLs but for non-technical users I’d like to add tools that fetch data from specific services such as Twitter, ScraperWiki or Nimble.com. These will be built on the same underlying tools as the web data tool but will be hard coded around the particular structure of the API in question. This will make it easier for users to pull meaningful data into their worksheets from third party sources.
  • Tool SDK
    Longer term I’d like to give people the ability to build their own tools and load them into QueryTree. The system is quite modular so coding up a new tool really just involves some JavaScript to define the tool at the client side, a HTML options dialog and a python object to define how the back end turns the settings into a SQL query. The only real challenge is productising it all, locking down the code that outsiders can upload and building the necessary development tools.

Looking back over this list, it struck me just how much I’m standing on the shoulders of giants here. QueryTree is a thin layer on top of the lot of third party components and my job has been simply to glue them all together in a way that is hopefully useful to others.

 


I'd love to meet you on twitter here.