QueryTree is a web based drag and drop tool for doing simple data analysis. Users drag and connect tools in a chain to first load in data, then manipulate that data (filter, group, sort etc.) and then visualise that data on a chart or map. I started building QueryTree by myself in between freelance projects about a year ago. More recently I’ve been able to devote more time to it and hire in another developer to help me. In this post I talk about some of the tools and technologies we used to build QueryTree and why.
The User Interface
- jQuery – naturally. I’m also using some jQuery UI elements such as dialog, draggable and button.
- Fabriq.js – this framework wraps the standard canvas API and gives you some extras like an object model for your shapes and SVG rendering. It does some extras like event handling but I’m not currently using those as the jQuery events were more familiar.
- Less – I’ve worked on so many projects were the CSS became a real mess and nobody dared refactor it for fear of never being able to get things looking right again. So for this project I took the decision to be a bit more organised from the start. Using less gave us a couple of weapons with which to beat back the CSS spaghetti: the nesting was most useful though, it enabled us to be specific about which elements we wanted to style, without adding lots of code.
- Handsontable – The table tool uses this plugin to display an editable table to the user. Crucially, it has support for copying and pasting in from spreadsheets and HTML tables, which is fantastic.
- Some free SVG icons
- This reset.css script
The Middle Tier
Once the UI was mostly built and demoing well to interested parties I started on the middle tier. My language of choice these days is Python and as I’d been taking advantage of Google App Engine‘s free usage allowance for hosting my static HTML and JS it was totally natural (almost accidental) to start adding bits in python. As is standard on App Engine, we’re using Jinja for templating and WebApp2 for HTTP handlers.
First off, I started working on a JSON API that the UI could use to save the current state of the worksheet and to request the data for a particular tool. From the UI side I built objects (subsets of the Knockout model objects) then just stringified them before using jQuery to POST them to a URL. At the server end I started out using json.loads and json.dumps and have never looked back. It just works and it’s simple so I never felt the need to use a framework for my JSON API.
For user settings and worksheet setup I just used the Google App Engine Data Store. I felt nervous about locking us into App Engine but again, it’s so easy to get hooked. Although, if we had to, I’m sure we could replace it with something like MongoDB, it’s basically just a framework for persisting python objects so the GQL queries and the .put() methods are the only places where you really interact with it.
When it comes to data processing, well, we cheated. Each worksheet in QueryTree is allocated to a MySQL database and any data that you upload to a worksheet is stored in a special table for that tool. When you add chains of data manipulation tools to that worksheet and connect them to a data source the python code “compiles” that into SQL and runs it on the MySQL database. So, in effect, QueryTree is just an SQL development environment. When you click on a tool, the query is run and your data returned as JSON objects over HTTP, which the UI then renders in the results area at the bottom of the screen.
The Back End
As I said above, all the data ends up in a MySQL database somewhere. We have a system for allocating worksheets to a database and can grow the pool of databases as required. We use cloud hosted MySQL but could manage our own if we wanted, the web app just needs to know how to connect to the databases. We have workers which can clear down tables that are no longer being used to free up space too.
Keeping the data in this way does place an upper limit on the amount of data that QueryTree can handle on one worksheet to however much data a single MySQL database can hold. In practice though, that data has to come over HTTP to get into QueryTree, either as a file upload or web fetch, so the databases are not the limiting factor.
In no particular order, we’re using the following additional tools:
- Paymill – We take payments using Paymill.
- Google Maps – The map tool renders a Google map in the results area.
- Testing Framework – We do automated testing of the full stack using a testing framework that I wrote myself (along the lines of qUnit). So much of the functionality of QueryTree is spread out across the UI, middle tier and the SQL that is generated, or in how these layers interact, that simply testing one layer didn’t add enough value. I could have written 1000s of tests for the UI alone or for the python middle tier alone, but they would not have given me any comfort about deploying a new version to live. So I built a framework which drives the application from the UI and which exercises the whole stack. If those tests pass, I can have some confidence in the whole system being basically functional.
- Bash – Whenever we want to deploy to live we type ./deploy.sh.
Aside from adding more tools and more options on the existing tools, there are three areas of the overall architecture that I’d like to improve:
There’s a lot we can do to improve the performance of queries. We can automatically index tables (especially tables from uploaded files because the data isn’t changing) based on the columns that users’ queries are scanning a lot. We can also tune the queries that we generate, we’ve just made it work so far, we haven’t really thought about performance yet at all.
We have the web data tool for pulling data in from URLs but for non-technical users I’d like to add tools that fetch data from specific services such as Twitter, ScraperWiki or Nimble.com. These will be built on the same underlying tools as the web data tool but will be hard coded around the particular structure of the API in question. This will make it easier for users to pull meaningful data into their worksheets from third party sources.
- Tool SDK
Looking back over this list, it struck me just how much I’m standing on the shoulders of giants here. QueryTree is a thin layer on top of the lot of third party components and my job has been simply to glue them all together in a way that is hopefully useful to others.
I'd love to meet you on twitter here.