I feel that opening with a dictionary definition of the word ‘pattern’ will lend bit of artistic flare to this post:
- A repeated decorative design.
- A regular and intelligible form or sequence.
In part one we established that a data scientist wants to find patterns, and a pattern is just something that repeats. So how do data scientists go about finding these “intelligible form(s) or sequence(s)”?
The data scientist’s tool set is diverse, including statistics, math, NLP, complex systems, and much more. Most of these tools fall under a branch of computer science known as machine learning. Unsurprisingly machine learning is defined as “the recognition of patterns and regularities in data”. The most commonly used techniques in machine learning can be chunked into a few broad types, which fall under two categories, ‘supervised’ and ‘unsupervised’.
Supervised – Supervised techniques require training data from which a pattern can be learned, like pictures of cats provided to a neural network so that it can learn what a cat looks like. Common examples include:
Regression, the prediction of a continuous variable, such as sales volume.
Classification, the prediction of a categorical variable, such as ‘male’ vs ‘female’ or ‘positive’ vs ‘negative’.
Unsupervised – Unsupervised techniques will automatically identify patterns without any prior training, such as a kmeans algorithm finding natural groupings within a data set. Common examples include:
Clustering, the identification of natural groups and clusters within a data set, like different types of consumer.
‘Optimisation’ , is more loosely defined, and usually encompasses more ad-hoc problems, like optimising for some KPI. A classic example of this is the travelling salesman problem. If a salesman has to visit ten locations around the country, which path will provide the shortest route?
Fortunately, the core principles of most machine learning problems are actually fairly similar. A simple example should be enough to give you a feel for this. Let’s say we want to predict how much a film is going to make in it’s opening weekend based on how many cinemas are planning to show the film. We also have a historical training data set containing a list of films, how much they each made at the box office and how many cinemas each film was shown in around the country.
If we plot that data on a graph it might look something like this:
As you can see, it looks like there is a relationship (i.e. pattern) between screen count and box office. The more cinemas (and hence screens) a film is shown on, the more money the film is likely to make.
But we want to be able to make a prediction, so we need to find a way to quantify that pattern. Based on the graph, it looks like a straight line would be a reasonable place to start. Like so:
If you cast your mind back to high school maths, you might remember that the equation for a straight line looks like this:
Changing the y intercept and the gradient of the line will change its position and angle on the graph. Therefore our machine learning problem is to identify what value and (known as coefficients) should take. In order to do this for a straight line we need to solve something called the least squares equation. A bit like magic, solving this equation will give us the coefficients for a straight line which will best fit the data. I won’t bore you with the details, but essentially the equation adds up the lengths of all the green arrows in the graph below, then spits out the coefficients which will result in the shortest average length of green arrows.
If you’re curious, this is what the magic least squares equation looks like:
Remember I said ‘the core principles of most machine learning problems are actually very similar’. That’s because most machine learning techniques can be boiled down to ‘fitting’ some sort of shape to a data set and then finding the coefficients which will make that shape look as much like the data as possible.
In the above example we were fitting a straight line to the dataset. In a classification problem, such as Twitter sentiment analysis, we would be fitting a multidimensional sheet to the data set. That sounds complicated, but if you imagine the tweets are floating in a cloud in front of you, where their location in the cloud is related to the content of the tweet, then the fitting a multidimensional sheet simply means sticking a sheet of paper thorough the cloud so that all the positive tweets are on one side of the paper and negative tweets are on the other.
Based on what I’ve said so far, you might now be thinking that data scientists spend most of their time fitting models and calculating coefficients. In an ideal world you’d be right, but in reality, problems tend to be messier than that.
Data science requires data. The clue is in the name. With poor quality data, even the fanciest algorithms in the world are useless. To be able to apply machine learning, data has to be clean, sufficiently detailed, with good coverage, and in the right format, which is rarely the case. Therefore a data scientist actually spends at least half of their time dealing with data manipulation. That means acquiring the data, possibly from an API or a web scraper, it means making sure the data makes sense and isn’t filled with holes and it means transforming the data into a format which the machine learning algorithms will best be able to take advantage of. This process is affectionately known as data munging.
This is why a unified data platform is so important. If all your data is in one, organised, easy to access location, a wealth of comparatively easy to leverage data science opportunities will suddenly open up for you.