BigData: Size isn’t Important, it’s What You Do With it That Counts

Lets say you run a medium sized business. You employ 1000 people and have 5000 customers. Heard about “BigData”? Think you should probably be doing some of that too? Well, don’t start building that Hardoop cluster just yet, because you probably don’t need it…

To illustrate, let’s imagine all 1000 of your employees spend 40 hours per week typing in data at 30 words per minute. Some back of an envelop calculation says that your 1000 employees will be generating 67 GB per year[1]. Lets allow for inefficient file formats and double it to 134 GB. With that amount of data we’re not even troubling the instance of MySQL that I’ve installed locally on my laptop.

OK, so maybe you’re collecting data from your customers and you think that that means you must need  to “do BigData”. Lets add all your 5000 customers into that equation. Assuming all your customers join in with the data generation effort and do nothing but generate data for you, you now have 402 GB per year. We’ll double it again and say that you have 804 GB of data per year. Do we need that BigData infrastructure yet?!

Well, no, not really. At 804 GB data growth per year, your developers and DBAs are going to have to think seriously about storage, archiving and query performance, but we’re well within the sort of data volume that a traditional relation database can cope with. It isn’t time to break out the map reduce libraries just yet.

BigData is cool right now. Some would say overhyped. And developers are always pushing to work with the latest cool gadgets. But for the vast majority of business out there, they’re not even close to genuinely needed all these high end high performance data crunching platforms, and the headaches that come with them.

There are some exceptions of course:

  • You have literally millions of customers
    If you’re Tesco and you have millions of people each buying hundreds of items every day, then you really do have a lot of data.
  • Modelling or experimental data
    Are you running experiments in a wind tunnel or running computer models of physical phenomena. If so then, yes, BigData tools are for you.
  • Crunching of third party data-sets
    Did you buy a 300 TB dataset off someone? If so then you’re going to need some serious infrastructure to handle it.

But for the rest of us: The focus should not be on building an impressive 3000 node cluster on EC2, it should be what you might call “Data Driven Business”. It’s A/B testing, it’s profiling of real customer behaviour, it’s making decisions based on scientifically run experiments rather than anecdotal evidence. If you’re doing that then you’re on the right track. The datasets might not be Terra-bytes in size, but that doesn’t matter. Size isn’t important, it’s what you do with it that counts.

 

[1] 1000 employees * 40 hours per week * 200 working days a year * 60 minutes per hour * 30 words per minute * 5 bytes per average word

Vim Tips

To search for the word under the cursor, type “*”.

To search for some text you’ve already yanked, type “/”, then “Ctrl+R”, then “0”.

To search for some text you’ve already yanked into register “a”, type “/”, then “Ctrl+R”, then “a”.

To search for some text you’ve copied into the clipboard, type “/”, then “Ctrl+R”, then “+”

Handling Method Parameter Default Values Using Moq

We at Contigo often use a pattern where methods with multiple nullable parameters retrieve collections of entities from a data store a bit like this:

IEnumerable<Thing> GetThingsFromDatabase(int? intFilter = null, bool? boolFilter = null, stringFilter = null)
{
...
}

These methods treat a null value as “no filtering required” so they can be called with only those filters that are relevant to the calling code. In this way we can keep database querying and object instantiation logic in a single place while our business logic specifies only those filters that it cares about. A typical line of business logic code might look like this:

var things = datamapper.GetThingsFromDatabase(boolFilter: true);

When it comes to unit testing our business logic code, we sometimes have an issue though. We use Moq to mock out our data mappers, and many of our tests look like this:

var mock = new Mock<IDataMapper>();
mock.Setup(mapper => mapper.GetThingsFromDatabase(It.IsAny<int?>(), true, It.IsAny<string>());
...

While optional parameters help de-couple our business logic from the filter parameters that it doesn’t care about* the same cannot be said for our unit tests. Adding a new parameter to our data mappers could break a lot of tests; all because the following is not allowed by C# Expressions:

var mock = new Mock<IDataMapper>();
mock.Setup(mapper => mapper.GetThingsFromDatabase(boolFilter: true));
...

As a way around this, we have implemented an extension method for Moq as follows:

static class MoqExtensions
{
public static ISetup<T, TResult> Setup<T, TResult>(this Mock<T> mock, string methodName, Dictionary<string, object> specifiedParameters) where T : class
{
// find the method on the mocked type that matches the name of the method we're mocking
var matchingMethods = typeof(T).GetMethods().Where(m => m.Name == methodName);
foreach (var matchingMethod in matchingMethods)
{
var matchingMethodParameters = matchingMethod.GetParameters();
// Check all of the specified parameters exist on this matchingMethod, otherwise continue to the next matching method
if (specifiedParameters.Count > 0 && specifiedParameters.All(sp => matchingMethodParameters.Any(p => p.Name == sp.Key) == false))
continue;
var defaultParameters = new List<Expression>();
// For each of the matching method parameters, create an It.IsAny<type>() expression unless the caller has specified a constant value for this paramter
foreach (ParameterInfo methodParam in matchingMethodParameters)
{
if (specifiedParameters.ContainsKey(methodParam.Name))
defaultParameters.Add(Expression.Constant(specifiedParameters[methodParam.Name], methodParam.ParameterType));
else
defaultParameters.Add(Expression.Call(typeof(It), "IsAny", new Type[] { methodParam.ParameterType }, null));
}

var methodCall = Expression.Call(Expression.Parameter(typeof(T), "mock"), matchingMethod, defaultParameters);
var expr = Expression.Lambda<Func<T, TResult>>(methodCall, Expression.Parameter(typeof(T), "mock"));
return mock.Setup(expr);
}
return null;
}
}

This can be called like this:

var mock = new Mock<IDataMapper>();
mock.Setup<IDataMapper, IEnumerable<Thing>>("GetThingsFromDatabase", new Dictionary() { { "boolFilter", true } });
...

Which is about as close as I think it’s possible to get to what we’re really after.

 

*Technically, the default parameter values are implemented on the caller’s side not the callee’s side so any code that calls our data mappers must be re-complied if we add a parameter. But as the code that calls them is within the same solution this is not an issue for us.

What Are You Optimising For?

Very few software designs/architectures are completely and objectively bad. Most designs optimise for something. Here’s a [probably incomplete] list of things that you could optimise for in software architecture:

  • Developer Time Now
    When we say we’re doing something “quick and dirty”, or that we are taking out “Technical Debt”, then we are optimising the development time now.
  • Developer Time Later
    This is what people traditionally mean by “good” code. A developer wants to write code that will be easy to work with several years from now. This is based on the perception that most of the hours a developer spends working, they are maintaining an existing system.
  • Product Iterations
    In simple terms this means building something that is highly flexible and is likely to meet a broad range of future (as yet unknown) requirements. By making the system flexible we reduce the number of iterations we are likely to need before the customer is happy with the product.
  • Testing Time
    While optimising for Product Iterations means making a flexible product, optimising for testing time is almost the exact opposite. The ideal system to test is one with a single button labelled “Do {Something}” which, when clicked, does exactly what it says it will. The more options, configuration, flexibility and programmability you introduce to a system the more time it will take to test to a reasonable level of satisfaction.
  • Backwards Compatibility
    This means designing a system to accommodate existing users who potentially don’t want to have to think about the new features. Or perhaps have lots of existing data in a structure that is not conducive and to the new features you want to introduce. By optimising for backwards compatibility, we’re building something in a way we wouldn’t normally like, but that will make it easier to roll out.
  • Deployment Risk
    In these days of hosted services and sand boxed apps this is becoming less of an issue. But anyone who’s had to think about rolling out an upgrade to a user community where you don’t know exactly what data they have or what the environment its running in looks like, then this will be familiar. Depending on the context you might make users go through long-winded upgrade procedures or pass up on framework upgrades and new technologies just to be sure that your next version will not fall over when it hits real world user machines.
  • Deployment Effort
    Sometimes the solution you want to build will be a nightmare to setup in production. It may require lots of servers, or perhaps a combination of services for caching, queuing, persistence, messaging and load balancing. In some cases, you may want to spend extra development effort building automated installation tools or even take a performance hit so that rolling it out is easier.
  • Support Time
    Sometimes it’s preferable to take extra developer and testing time to introduce automated installations, lots of logging and diagnostics, remote crash reporting, user friendly failure messages with troubleshooting guides, extra documentation, lots of context sensitive sensible defaults or reduced flexibility. All so that it’s less work to support once it goes out the door.
  • End User Learning Curve
    There’s often a market for an entry level version of apps that maybe don’t have all the power of the competitor products but which is easy to pick up and use. Alternatively, you could trade development and testing time and build an easy to use product that has a lot of flexibility behind the scenes for the power users.
  • End User Usage Effort
    If you’ve ever used VIM, you’ll understand what I’m getting at. VIM sacrifices End User Learning Curve in order to optimise for End User Usage Effort. Sometimes developer and tester time is taken to build hundreds of special business logic that will make the system do exactly what the user needs exactly when they need it.
  • Performance
    This often means sacrificing Developer Time Now/Later in order to make the system run faster. For example, using an off the shelf ORM tool will reduce the code you need to write and maintain, but to get that extra performance people often resort to hand crafting the database queries they need.
  • Scale
    The ability to handle millions of users or Terra bytes of data often comes at the expense of developer time, deployment effort or single user performance. If you really need to handle that scale then you’ll have to sacrifice something else on the list to do it.
  • Budget
    If you have very deep pockets then most other problems can be made to go away. If you’re willing to build your own data centre with 10,000 servers in and an army of people to keep your site running then you can defer having to think about writing scalable code or deployment effort. This    is sometime a logical approach. This is not always true though. A baby takes 9 months no matter how many women you have.

I find that conversations between engineers about the best architecture to use are always easier and less emotive when people are clear about what they’re optimising for, why, and what they’re willing to sacrifice in the process. It’s helpful to keep asking: what are you optimising for?

 

Follow People on Hacker News with hn_comment_follow

I often find the comments on Hacker News are fantastic and there are certain users who’s opinion I always value. To that end I’ve created a python script to help follow what my favourite people are saying on Hacker News: hn_comment_follow. It’s on GitHub in case others would like to fork it.

To call it, invoke the script like this:

python hn_comment_follow.py pg patio11 d4nt

The list of users should be space separated.

The output will look a bit like this:

2 days ago

pg

http://news.ycombinator.com/item?id=3435252

This sounds like a crazy plan for a startup, I realize, but this is the right sort of crazy. ;In fact, the way the Hackruiters think about Hacker School is a lot like the way we initially thought about YC: if it doesn’t make money, it will at least have been a benevolent thing to do.

patio11

http://news.ycombinator.com/item?id=3434547

Props to Microsoft, but this is actually pretty routine. ;(It was literally written policy at a previous employer of mine.)Manual exception handling at the warehouse is crazily expensive. ;It is much, much easier to write it off (as shrinkage, not charity) than to get the item back into active inventory (all the fun of chasing invoices, except the amount payable is “one XBox”, and the person doing the chasing sees their general productivity go to pot), particularly as it may have been opened. ;The charity suggestion removes many customer objections and ends the ongoing CS expense almost immediately.

patio11

http://news.ycombinator.com/item?id=3434967

Just staying, our literally written policy was “Offer DDD”: donate, destroy, or “dispose of” (a polite euphemism for “You keep it”) the misshipped item. ;I would have added the Christmas flourish if I were saying it in December, too, but the options would have been the same in July. ;(n.b. The business does not care what you do. ;We want to convey, in the politest possible way, that we both don’t want it and don’t want to talk to you about it.)

Five Software Architect Antipatterns

I believe that all software design comes down to trade-offs, and the only way for software architects to get to the right decisions on those trade-offs is for them to have a very broad and senior role within their organisation. I’ve observed some people in the industry who are called architects but, for one reason or another, are not in a position to design software well. Here are five types of software architects that are ill placed to actually design software.

1. “Well Paid Programmer” Architects

Sometimes, giving people the job title of Architect is just a way of giving them a pay rise without their job materially changing. Some architects are really just programmers who reached the top of their pay scale and didn’t want to manage people. In theory they’re supposed to be setting the technical direction but in reality they’re no more influential than an opinionated developer. Quite often they’re locked in conflict with other developers about the direction of the system being built and lack the necessary clout or managerial backing to enforce architectural decisions.

Lesson: Software requires tough choices to be taken, once you’ve found the right person for the job, they need management backing and they need the strength of their convictions to own that role.

2. “COBOL Programmer” Architects

Some architects are programmers who got stuck on an old technology that their company doesn’t use any more; But making them an architect is a way of keeping them around in case those old systems fall over. These architects spend a lot of time getting involved in non-work related bits and bobs. You’ll find them on the fire safety committee, they’ll be one of nominated first aiders in the office, a regular organiser of corporate charity fundraising events, they always have plenty of time to help the support team with the aforementioned COBOL system and they will always have an apt anecdote from the days of 80 column card interfaces or EBCDIC encoding. They don’t really know how to code in the languages that are used by the developers these days and they have an annoying habit of referring to IE as “Mosaic”, but they can just about cost up a new server, given a few days notice.

Lesson: Software Architecture is a real job that people need doing, its not a way to put old techies out to pasture. If you want to keep those people around then fine, but call them “Consultants” or something and hire a proper architect.

3. One Tick Architects

Some architects know one thing, like Networking Configuration or Relation Schema Design really really well but little else. Designs get written which explain in fine detail how the subnets of the servers’ network cards are going to be setup, but skip casually over whole areas of detail with statements like “A suitable object relational mapping technology (Linq2SQL/EF/NHibernate) should be used”.

Lesson: You’re skills in one discipline may have been enough to get you the architect job, but the value of the architect is in their cross disciplinary skills; To do your new job well you need to start learning news skills, and fast.

4. Back Room Architects

Some architects lack the basic personal skills to be allowed out in front of customers or users and consequently come to view their role as a purely technical one. Unaware of the political and budgetary landscape their projects are operating in, their designs are masterpieces in resilient, scalable, flexible and buzz word filled architecture which, if actually followed, would cost millions to complete. These architects are perpetually locked in battles with project managers over how much time it is reasonable to spend building each system. Look out for phases like “one day we’ll actually build a system right from day one” i.e. they see software architecture as a binary good vs. bad issue as opposed to a series of trade offs.

Lesson: All decisions in software design are compromises between development time, hardware cost, supportability, usability, maintainability, tooling costs, recruitment considerations, staff retention considerations, the aspirations of the people paying for it, the actual needs of the users and many other things, you can’t understand all those competing influences by sitting behind a desk running Emacs, you have to get out there and talk to people.

5. Non-Technical Architects

These architects are really just business analysts. They are great at talking to customers, they spend lots of time capturing the non-functional requirements of the system and writing documents that list those non-functional requirements for the benefit of the developers and server engineers. But as far as solutions go they haven’t a clue. They are full of handy suggestions like “we should do some performance testing” but when asked what to do with the results, they fall back on sledgehammer solutions like “buy a better server” or “get the programers to add some caching”.

Lesson: If you’ve somehow made it into an architect role and you’re NOT a C Hacker/SQL Guru/Bash Ninja then you have a problem. The techies will soon start to figure you out and your authority over them is dependant on you reaching some minimal level of technical ability. Find a way to start adding value to what they do before they give up on you.

Ansl – A .NET Search Library

I’ve written an implementation of the a simple search library in C#. It’s called Ansl (A .NET Search Library) and it implements the TF-IDF algorithm for indexing text documents.

By default it stores it’s index in memory but its storage engine is pluggable and it comes with an implementation of a folder storage class too so indexes can be built up and stored in the file system for later searching.

In time I’d like to add a Vector Space Model implemenation and then have a FindSimilar(string documentId) method for finding similar documents. I’m not getting much free time at the moment though so that may take a while.

Fixing MacVim .gvimrc Encoding Errors

For weeks I’ve been going mad trying to figure out why most of the settings in my .gvimrc file were being ignored on MacVim.

At first I thought it must be some strange quirk of MacVim that it didn’t support the same settings a GVim (on Windows) but everything I read online suggested that wasn’t it.

I have my .gvimrc and .vim folders sync’d using Dropbox. I’ve setup symbolic links on my Mac, like this


ln -s ~/Dropbox/vim/.gvimrc ~/.gvimrc
ln -s ~/Dropbox/vim/.vim ~/.vim

And on my windows machine, like this


cd C:Usersme
mklink /H .gvimrc Dropboxvim.gvimrc
mklink /D vimfiles Dropboxvim.vim

This keeps my windows and mac vim things in sync. But it wasn’t working on the mac.

By chance, I decided to step up the mvim script that comes with MacVim, and I notices that when running mvim from the terminal I was seeing lots of errors like this being output to the terminal


E474: Invalid argument: modelines=0^M
line   10:
E474: Invalid argument: tabstop=4^M
line   11:
E474: Invalid argument: shiftwidth=4^M
line   12:
E474: Invalid argument: softtabstop=4^M
 

This was a clue, at first I thought that windows style rn line ending might be causing it. But it turns out is was that the file was saved using DOS encoding. Clearly it had first been created on the Windows machine. To fix it I opened my .gvimrc file, went to the Edit menu, File Settings, and selected File Format… I clicked on the Unix encoding and now both my MacVim and Windows GVim installs are working fine.

On Developers and Managers

I recently got drawn into a bit of a debate on a LinkedIn discussion group about whether a Development Manager should be spending time writing code or not. This is something I feel quite strongly about and the following is a heavily reworked version of my response.

If I asked a group of sales managers whether a head of sales should still be involved in some selling, I think most of them would say yes. Not full time of course, managing things does take time, but nobody thinks its strange for a head of sales to still do some sales meetings and prepare a few of the most important proposals. Same goes for Law, Medicine, Architecture, Project Management, Civil Engineering, Advertising, Fashion Design, Cooking and Teaching. But a Development Manager who writes the odd bit of code? Surely he/she needs to learn to let go!

Why is this? My theory is that it stems from a polarisation of the development and non-development IT cultures. 

Most new developers undervalue the non-coding parts of the development life cycle like analysis, testing and support. Probably like many freshly trained young professionals they view their supporting functions as (a) optional and (b) something they could do themselves if they felt so inclined. Unlike other industries though, this never really gets beaten out of them. Which leads to a strong developer sub-culture and removes incentives to participate in and ultimately master the usual political/interpersonal shenanigans that exist in any workplace. They simply choose not to participate, instead aiming to exercise influence by ignoring that world and just building what they know to be the right thing anyway. 

For their part, the rest of the IT industry often seems happy to let the developers run their own affairs without much of a challenge. It is very rare to see a non-developer attempt to argue with a developer on their own turf. Very few non-developers make any attempt to learn the basics of programming, read up on the fundamental principles or apply critical thinking to interrogate a programmer’s assertions. The rest of the industry give developers an easy ride and they shouldn’t. Ultimately both sides are equally culpable for allowing these two separate IT cultures to exist.

I don’t believe for a second that developers are not capable of the kinds of communication, persuasion and manipulation that it takes to progress into management. Very few of the developers I’ve worked with are truly autistic or at all confused about how to operate in political work environment they just look down on it. Similarly, I hate the almost religious aversion that many non-developers have towards writing anything remotely like code. I’ve seen intelligent people who can design processes, think logically and make Excel do amazing things be presented with the most straight forward of SQL queries and insist point blank they aren’t cut out for coding and therefore can’t (i.e. won’t) do it. It makes me want to cry sometimes. 

It’s this polarisation of cultures that leads to so few developers being asked to manage their teams and so few of those that are asked accepting the job long term. With so many non-developers managing developers it’s little wonder they seek to define the role as predominantly non-technical and purely “managerial”. 

This is a shame, because: 

  1. Many developers would benefit from some actual hands on management (they might not enjoy it at first though). They would be forced to learn. Their performance would be more accurately assessed and their design choices would be regularly challenged.
  2. Many development teams would benefit if someone with a bit of clout in the business spent some time in the trenches and saw what was happening. 
  3. Many managers would benefit from having coding skills. In fact most profession people would benefit from having some coding skills. Coding is not magic, it’s like selling or design or being able to chair a meeting really well, if you’ve got it in your armoury you’ll be more productive. 

My current role combines some technical architecture and customer consulting, but even just looking at the managerial parts, I use coding to help me get everything done.

Recently I’ve: 

  • Written a greasemonkey script to make our time keeping and reporting app easier for me to use. 
  • Learnt how to use the TFS power tools command line app and PowerShell’s XML commands to analyse and report on our progress. 
  • Used lots of regexs to search for things in documents 
  • Used lots of Excel functions to build planning ‘apps’ in Excel. 

In conclusion, if you’re a developer I recommend you don’t opt out of departmental politics. People are only a bit less predictable/knowable than complex systems but they are much more rewarding to hack. If you’re a manager in charge of developers I think you owe it to your team to get stuck in and really manage what they’re up to. Finally, if you’re just someone who uses computers for a living, learn to code. You’re sitting in front of one of the most powerful devices humanity has ever constructed, you’ll save yourself so much time!

Copying Work Item Templates Across Projects in TFS

To copy a work item template definition from one TFS project to another, install the TFS Power Tools and then type the following commands in the Visual Studio 2010 Command Prompt:

 


witadmin exportwitd /collection:http://tfsserver:8080 /p:"SourceProject" /n:"Work Item Type Name" /f:"C:PathToLocalFile.xml"
witadmin importwitd /collection:http://tfsserver:8080 /p:"DestinationProject" /f:"C:PathToLocalFile.xml"