How We Work On Big Data Matters

Earlier posts on Big Data and Analytics for HR produced some interesting responses – among them a number of pre-packaged program vendors who have put together offerings for HR specifically. These may or may not be a place to start for HR departments with the budget to afford dedicated off-the-shelf solutions. I’m just not sure. What’s clear is you can find quite an array of consultants and products all presenting as best solutions, but if you don’t know what it is you’re trying to achieve, watch out.

In an effort to gain broader clarity on my own thinking, I went back and read more thoroughly a book on prediction that was popular a 18 months ago – The Signal and the Noise by Nate Silver. Its popularity grew considerably when he used his algorithms to predict 49 of 50 states correctly in the 2012 US Presidential election – a man who understands how to pull solid information from Big Data… perhaps.


Image from

If there is a central lesson, one I agree with, it is that we often misuse or misconstrue statistics. He does a fairly good job of distinguishing between the types of statistics and, in particular, focuses on Bayes Theorem approaches and describes them simply and usefully I thought, arguing they offer a better method than ‘standard’ stats for trying to dig conclusions out of Big Data. I think it’s an achievement to make this as simple and readable as he does. A longer summary with history appears in Wikipedia and, of course, at where, as usual, I focus on the ‘3-star’ reviews because they tend to give both pros and cons. (The 5-star often are people bowled over at first look and the 1-star are sour types who don’t like much of anything. 3-star writers, on the other hand, often know the subject well enough to offer logical evaluations and tend to spend more time justifying both good and bad points.)

Among comments that stand out from the book: “Companies that really ‘get’ Big Data, like Google, aren’t spending a lot of time in model land. They are running thousands of experiments every year and testing their ideas on real customers.” As usual we need to take such flat statements with grains of salt. Earlier he points out you need to temper the use of pure numbers with theories about what makes the results happen, so you test hypotheses that you think make sense, not just random number crunching. Moreover, ‘thousands’ of experiments is most certainly an exaggeration even for Google, though they recently announced they are conducting a 100-year longitudinal study of work-life balance. Not many of us have resources nor thing we have the time for such undertakings though we can hope lessons learned there will eventually have wider application.

Lots of times the ‘experiments’ referred to require simply looking at different slices of a pile of data that’s been assembled on a wide variety of topics in an organization to find connections and then seeing if the connections hold true over time. The old idea of data warehouse led in that direction – if you could put every piece of data you could collect from anywhere into one big database, then you could extract and compare just about any two or two dozen parameters to look for connections. However, as Silver argues, many of the apparent connections would not bear out as you look further to see if they continue to be related. His point – a great many apparent correlations turn out not real – another few samples will show the correlation breaks down – and that’s even without suggesting there are causes linking the factors. For instance, I recently read and promptly forgot about a company that found their best programmer recruits all happened to know how to draw Manga comics and so made that a criterion for hiring. How big a sample was that, do we think?

He notes the best pattern recognition computer by far is the human brain, which he reports scientists now believe handles about 3 terabytes of data. Massive as this is and terrific as our power to apply much of it flexibly at once to recognize possible patterns, this is only about the amount of data being recorded worldwide each day now. Seeing patterns can seem easy, but samples only a tiny fraction of the material we could potentially examine.

There’s no doubt the 3-star reviewers are right to doubt Silver’s total belief that Bayes theorem is completely superior to other sorts of statistics, but it does suggest some very important ways to think about uses of Big Data. Some suggested books they like better, which could be useful as well (see link above re these). Like them I found some parts of the book likely just plain wrong, though generally interesting.

What’s definitely of most use is his emphasis on learning to think in probabilities that something will or won’t happen or alternatives rather than trying to come up with predictions that are iron clad. His advice – bet the odds, bet the most probable outcomes. This is something that definitely works in HR where we deal with the most unpredictable commodity of all – people. It’s not whether your employees will like a particular benefit, but what percentages will value it highly, moderately or not at all and who will find it downright insulting (paternalistic, intrusive or whatever).

All this reinforces the idea that the best approach to Big Data is to think about a number of questions you hope it might answer and start trying. Pull slices of data and see if potentially useful relationships seems to exist, but don’t assume the first crack at it will answer exactly what you hoped or even be terribly accurate. Continue to look, test, question, refine. Don’t regard it as a one-time system implementation that will finish the issue once and for all. It offers tools to do many things and those will evolve as you try, test and learn and business needs evolve as a result.