With a background in theoretical physics, I naturally have a great fondness for theoretically interesting, but not necessarily practically useful, tasks. When I first started as a data scientist this was a compulsion I worked to avoid, and is a common problem for new data scientists, especially those with an academic background adapting to a commercial environment.
We must be careful, however, not to entirely expunge the habit. Instead, where appropriate, we should leave time for such indulgences. As is so often the case, moderation is the key.
With that in mind, here I provide a short anecdote to illustrate a situation in which time invested in a solid theoretical understanding pays dividends — and while I am at it — also emphasises the importance of good quality data, which I can never stress enough!
Recently we were provided with an anonymised dataset, which represented the users of a typical Facebook based game. The data included metrics such as activity levels, money spent, friend counts and most interestingly of all: ‘share’ information.
Initially we applied our standardised suite of statistical tests to the data set – but, this wasn’t a standard data set – this was a complete set of networked data. Two rare gems in the life of a data scientist.
Naturally, we couldn’t resist running a few network structure tests on the users. As expected, we discovered the users were related to each other in what is known as a “scale free” network – a type of network topology theorised to arise spontaneously in human social groups.
Or, in graph-speak: A scale free network is a network whose degree distribution follows a power law. That is, if you were to plot frequency against the number of connections per node the resulting curve would be described by a power law equation.
Which means: The number of relationships each person has can be described by this equation:
Where k is the number of relationships a person has and alpha and gamma are constants varying from network to network. Plotting such an equation results in something like this:
A power law is the technical term for what marketers know as ‘the long tail’. In our example this means that out of the whole group of players there is a relatively small number of people, known as ‘hubs’ or ‘top influencers’, with a very large number of connections.
The concept of ‘top influencers’ is not a new one in business and marketing. What is new, however, and should not be underestimated, is a solid theoretical footing for this knowledge. Such a footing has a knock on effect, often vastly increasing the amount of insight which can be gained from a dataset through any number of methods, including: the definition of brand new metrics; new statistical tests, which would not have been useful otherwise; or even the ability to make predictions.
In our example, knowing ‘top influencers’ exist is only of limited utility. On the other hand, to know exactly how many influencers should be contacted and exactly what sort of result to expect is much more powerful.
This sort of network structure can result in behaviour that intuitively appears extreme:
In the aforementioned game we found that 26% of the total viral game growth was due to only 0.2% of the population and a full 90% of viral growth was due to 22% of the population.
Imagine you would like to grow an existing customer base of 1000 people; now imagine you decide to send out messages encouraging the customer to tell all their friends about your fantastic service, and that each message costs 1 penny to send.
One option would be to spend £10, contact every single person, and perhaps grow your customer base by 100 people. Another option would be to use your knowledge of scale free networks to spend £2.20 on contacting only the most influential people in the network, and then see a growth of 90 new customers.
All in all, that means a saving of 78% while attaining almost all potential growth!