Everyone knows data science is important. Most people have heard stories about companies who used data science to predict the future, revolutionise their business, or even disrupt an industry. Many people know that data scientists ‘build models’, ‘write algorithms’ and ‘make predictions’. Few people know what those buzz words actually mean, let alone why we use them. The answer, in a nutshell, is simple: The purpose of data science is to find patterns.
Understanding patterns means understanding the world. In everything, from a mechanic fixing a car, to a farmer planting crops, to a scientist making some research breakthrough – the identification of a pattern is the first step towards progress.
An ancient farmer notices his crops are always more productive in areas where cattle have previously been, and so discovers fertiliser. A mechanic discovers why a car’s headlights randomly stop working by performing tests in order to establish a pattern. Do the lights always work when the car is stationary? What about breaking or accelerating, or over a certain speed? What about when the heating or radio is on? With each test the pattern gets stronger and the mechanic is one step closer to a solution. In short, pattern finding means understanding, problem solving and prediction.
As you might imagine, the human brain is fantastic at finding patterns; after all we got by for thousands of years before data science came along. The trouble is, it’s a little too good. It likes finding false positives, this makes sense when you consider evolution. If a caveman sees a funny shaped shadow, then runs off convinced they saw a lion, they’ve not really lost anything. On the other hand, if the caveman saw a lion and passed it off as a funny shaped shadow, it could very well end up the last mistake they ever make. As a result human beings evolved to see lions everywhere.
The most common pattern finding trap in day-to-day business is the simple act of eyeballing graphs for trends. We humans are great at looking at graphs and seeing trends which aren’t really there.
Take a look at these two graphs below. Which do you think shows a pattern and which is random?
Instinctively we feel like A is random thanks to the even spread of dots, and that B shows a pattern due to what appears to be a diagonal line of increased density running from the top left to the bottom right of the graph.
In actuality the opposite is true. B was generated using an entirely random algorithm. The uneven, clumpy nature of the data is actually to be expected in a random dataset. A on the other hand is less random as the generation process included processes to ensure the data was evenly spread.
But even when we aren’t seeing lions, a human being’s ‘modeling’ capability is limited. We are great at developing a powerful instinct based on past experiences. It’s pretty much the definition of an expert. Someone so familiar with their art they develop a gut feel for how they should deal with any given situation, often without being able to verbalise exactly why they feel that way. What’s going on here is a bit of subconscious pattern finding. Obviously this sort of instinctive model building can be very powerful, but it is ultimately limited by the fact that it is a black box to us. We don’t know what our subconscious is modelling or why it’s coming up with the predictions that it does. That means it can’t easily be extrapolated, shared with others or applied to new situations
I came across a great example of this recently when investigating an NHS dataset. My task was to model A&E attendances in an attempt to find predictive correlates in public data sets. One of the strongest predictors I identified was simply ambient temperature — the warmer it is, the more people go out and hurt themselves. When I mentioned this to my doctor friend1 his eyes went wide and he told me I’d just solved a long running mystery for him. Apparently, when he used to work in A&E, he would sometimes get a ‘feeling’ for when it was going to be a particularly bad (or good) day, which often seemed to be eerily accurate2. He said it almost felt like a premonition. He’d be walking to the hospital before a shift and find he had a distinct feeling for what kind of day it was going to be. He even admitted to wondering if he was psychic after one particularly gruelling shift. In light of my analysis, this suddenly makes sense. Even though he never consciously made the connection, his subconscious put two and two together and noticed that whenever the walk to work was unusually warm it meant a bad day was coming. But despite the power of this instinct, he didn’t really understand it, so couldn’t properly exploit it.
Unlike my friend’s wannabe-psychic abilities, our data science derived insight can be shared and applied to new situations. Perhaps other hospitals without their own resident psychic doctor could benefit from the insight? Not to mention other public services and local government.
At this point you might be wondering why I haven’t mentioned ‘big data’. For a few years everyone was very excited about big data and how data science could make sense of it. The big data hype has started to die down, but it is still an unfortunately common view that the purpose of data science is to make sense of data that is ‘big’ or unwieldy. While this is true, it completely misses the point; data science is revolutionary because it provides flexible computer aided pattern finding, able to bring scientific rigor and objectivity to all problems from ‘human scale’ all the way up to true ‘big data’. It doesn’t matter whether you have one hundred or one billion rows.
Stay tuned for part 2: ‘How Do We Do It?’
1 I know, I know, it’s sample of one, but I’m trying to tell a story here.
2 Now I’m resorting to anecdote. This blog post is going downhill fast! Not to worry, I promise there will be more science in part two.