BigData: Size isn’t Important, it’s What You Do With it That Counts

Lets say you run a medium sized business. You employ 1000 people and have 5000 customers. Heard about “BigData”? Think you should probably be doing some of that too? Well, don’t start building that Hardoop cluster just yet, because you probably don’t need it…

To illustrate, let’s imagine all 1000 of your employees spend 40 hours per week typing in data at 30 words per minute. Some back of an envelop calculation says that your 1000 employees will be generating 67 GB per year[1]. Lets allow for inefficient file formats and double it to 134 GB. With that amount of data we’re not even troubling the instance of MySQL that I’ve installed locally on my laptop.

OK, so maybe you’re collecting data from your customers and you think that that means you must need  to “do BigData”. Lets add all your 5000 customers into that equation. Assuming all your customers join in with the data generation effort and do nothing but generate data for you, you now have 402 GB per year. We’ll double it again and say that you have 804 GB of data per year. Do we need that BigData infrastructure yet?!

Well, no, not really. At 804 GB data growth per year, your developers and DBAs are going to have to think seriously about storage, archiving and query performance, but we’re well within the sort of data volume that a traditional relation database can cope with. It isn’t time to break out the map reduce libraries just yet.

BigData is cool right now. Some would say overhyped. And developers are always pushing to work with the latest cool gadgets. But for the vast majority of business out there, they’re not even close to genuinely needed all these high end high performance data crunching platforms, and the headaches that come with them.

There are some exceptions of course:

  • You have literally millions of customers
    If you’re Tesco and you have millions of people each buying hundreds of items every day, then you really do have a lot of data.
  • Modelling or experimental data
    Are you running experiments in a wind tunnel or running computer models of physical phenomena. If so then, yes, BigData tools are for you.
  • Crunching of third party data-sets
    Did you buy a 300 TB dataset off someone? If so then you’re going to need some serious infrastructure to handle it.

But for the rest of us: The focus should not be on building an impressive 3000 node cluster on EC2, it should be what you might call “Data Driven Business”. It’s A/B testing, it’s profiling of real customer behaviour, it’s making decisions based on scientifically run experiments rather than anecdotal evidence. If you’re doing that then you’re on the right track. The datasets might not be Terra-bytes in size, but that doesn’t matter. Size isn’t important, it’s what you do with it that counts.

 

[1] 1000 employees * 40 hours per week * 200 working days a year * 60 minutes per hour * 30 words per minute * 5 bytes per average word


I'd love to meet you on twitter here.