In part one we established that a data scientist wants to find patterns, and a pattern is just something that repeats. So how do data scientists go about finding these intelligible forms or sequences?
The data scientist’s tool set is diverse, including statistics, math, Natural Language Processing (NLP), complex systems, and more. Most of these tools fall under a branch of computer science known as machine learning.
Unsurprisingly machine learning is defined as “the recognition of patterns and regularities in data”. The most commonly used techniques in machine learning can be chunked into a few broad types, which fall under two categories, ‘supervised’ and ‘unsupervised’.
Supervised techniques require training data from which a pattern can be learned, like pictures of cats provided to a neural network so that it can learn what a cat looks like.
Common examples include:
Regression, the prediction of a continuous variable, such as sales volume.
Classification, the prediction of a categorical variable, such as ‘male’ vs ‘female’ or ‘positive’ vs ‘negative’.
Unsupervised techniques will automatically identify patterns without any prior training, such as a means algorithm finding natural groupings within a data set.
Common examples include:
Clustering, the identification of natural groups and clusters within a data set, like different types of consumer.
‘Optimisation’is more loosely defined, and usually encompasses more ad-hoc problems, like optimising for some KPI. A classic example of this is the travelling salesman problem. If a salesman has to visit ten locations around the country, which path will provide the shortest route?
Fortunately, the core principles of most machine learning problems are actually fairly similar.
Let’s say we want to predict how much a film is going to make in its opening weekend based on how many cinemas are planning to show the film. We also have a historical training data set containing a list of films, how much they each made at the box office and how many cinemas each film was shown in around the country.
If we plot that data on a graph it might look something like this:
As you can see, it looks like there is a relationship (i.e. pattern) between screen count and box office. The more cinemas (and hence screens) a film is shown on, the more money the film is likely to make.
But we want to be able to make a prediction, so we need to find a way to quantify that pattern. Based on the graph, it looks like a straight line would be a reasonable place to start. Like so:
If you cast your mind back to high school maths, you might remember that the equation for a straight line looks like this:
I won’t go into the details, but essentially the equation adds up the lengths of all the green arrows in the graph below, then spits out the coefficients which will result in the shortest average length of green arrows.
Most machine learning techniques can be boiled down to ‘fitting’ some sort of shape to a data set and then finding the coefficients which will make that shape look as much like the data as possible.
In the above example we were fitting a straight line to the dataset. In a classification problem, such as Twitter sentiment analysis, we would be fitting a multidimensional sheet to the data set. That sounds complicated, but if you imagine the tweets are floating in a cloud in front of you, where their location in the cloud is related to the content of the tweet, then the fitting a multidimensional sheet simply means sticking a sheet of paper thorough the cloud so that all the positive tweets are on one side of the paper and negative tweets are on the other.
Based on what I’ve said so far, you might now be thinking that data scientists spend most of their time fitting models and calculating coefficients. In an ideal world you’d be right, but in reality, problems tend to be messier than that.
Data science requires data. The clue is in the name. With poor quality data, even the fanciest algorithms in the world are useless. To apply machine learning, data has to be clean, sufficiently detailed, with good coverage, and in the right format, which is rarely the case. Therefore a data scientist actually spends at least half of their time dealing with data manipulation.
That means acquiring the data, possibly from an API or a web scraper, it means making sure the data makes sense and isn’t filled with holes and it means transforming the data into a format which the machine learning algorithms will best be able to take advantage of. This process is known as data munging.
This is why a unified data platform is so important. If all your data is in one, organised, easy to access location, a wealth of comparatively easy to leverage data science opportunities will suddenly open up for you.