There have been a few terms from maths and science that have entered into daily usage to become buzzwords in this past decade: e.g. Chaos, Complexity and Sustainable. We see them branded on buses in adverts; we hear them in business talks and sales pitches; we hear them a lot when a politician is talking but not really saying very much.

These words have precise mathematical meanings and their overuse as adjectives in day-to-day speech, whilst irritating to mathematicians everywhere, is also a good reflection of how we think about our world.

Let’s consider ‘complexity’. In day to day usage, ‘complex’ is a term given to something when we struggle to understand it, when there are too many parts or ‘knock on effects’ to keep track of.  For example, when changing one thing affects many other things, or when your connection to other things that you know about also mean you are connected to things you do not know about. This can lead to the idea of ‘un-known un-knowns’ and ‘Black Swan Moments‘ and in this blog post I will introduce one of the methods used to help understand connections and dynamics (things that happen over time) across networks. I want to show how much we can suddenly know when we consider that things are connected in a certain way. That a group of things, people or tweets has a structure, and that this structure matters.

Stephen Hawking is noted for calling the 21st Century the “century of complexity” (read more about that here) and a quote from Gell-Mann ( Winner of the 1969 Nobel Prize in Physics) explains:

“Today the network of relationships linking the human race to itself and to the rest of the biosphere is so complex that all aspects affect all others to an extraordinary degree. Someone should be studying the whole system, however crudely that has to be done, because no gluing together of partial studies of a complex nonlinear system can give a good idea of the behaviour of the whole.”

Scientists started looking at the ‘whole’ and realised how incredibly connected it is. For centuries, scientists had been collecting incremental knowledge about everything (we call this reductionism), reducing something to a process and then parts of that to a smaller process. Over the last half century there had been a push to stop looking all these ‘facts’ in isolation. We discovered we cannot simply ‘glue’ all this knowledge together to tell us about the whole. We need to look at the whole.

Disclaimer 1: There are no equations here and I have tried to use plain language instead of mathematically precise language (sorry mathematicians, but there are ‘correctly’ worded papers on Google scholar!)

Disclaimer 2: This blog post is a bit cut and dry in the middle, in summary it’s trying to say that measuring and understanding the structure of connections can be as important if not more important than measuring each part.

Definitions

A network is a representation we create when we want to link things together. If we are considering Twitter, then the things we want to link (nodes) can be users, tweets, or parts of tweets, e.g. hashtags, ngrams (specific words, or combinations of specific words). The links are made based on a relationship between the nodes; e.g. users can be linked because they follow each other or mention each other; hashtags can be linked because the same user mentions them both.

Descriptive statistics are measures that describe the group of things (the ensemble). When the measure is of the network (things plus links) rather than the ensemble (just things), we are describing in some way the structure of the group. The network has node properties and edge properties, and we describe both of these plus the structure that is created.

User Influence

Right now our proxy-measure for user influence is ‘tweet volume multiplied by the number of followers’. This is not such a great metric. We need to place more emphasis on measuring engagement rather than the potential for it. Most Twitter users do not engage with every tweet that they follow.

This pulls us towards considering the more ‘active networks’ such as mention-networks (where a directed link forms from user-a to user-b when a mentions b) and on hashtag-topic-networks (where a non-directed link is formed).

Network Types: Characteristics.

Before we continue I want to talk about network types:

A network is either directed or non-directed.  In directed networks the links are like arrows, (e.g. user-a follows user-b) or non-directed (e.g. #starwars and #harrypotter are tweeted by the same user).

A network is either weighted or non-weighted.   In a non-weighted user network  users are connected or not connected (e.g. a follows b), but in a weighted user network we assign a number to each link, e.g. if user a mentions user-b 5 times then the link between user-a and user-b would have a weight of 5. This means there are ‘4 types’ of network, the combinations of directed or non-directed and weighted versus non-weighted. The way we measure each one depends on the type.

What do we mean by ‘Measure’?

In the same way an ‘average’ metric is a measure of the distribution of an ensemble and lets us infer something about the group. For example, mode (most commonly occurring) age at a 16th birthday party is 15 then maybe the birthday person is one of the ‘older’ of their friends, and the mean age may be 18, meaning that there are some adults there.  Lets look at this in a diagram:

Graph1

In Dick Fear’s previous blog post, it was mentioned that a lot of data science is about fitting curves to things. Here I could measure certain things about the shape of the red curve and use it to infer things about the party.

We are mapping the measurements (metrics) to some real world assumptions. There are of course many other explanations for these statistics, we can add together a 28 year old and an 8 year old to the mix and the statistics would not change, however, if our question was how many party bags to provide, when we only want to give party bags to the teenagers, then we can use the metrics we have to make a good estimate.

Network Dynamics

Above, I used the example “A network can be weighted e.g. user a mentions user b 5 times”.

There is a time frame here, if we say user a mentions user b 5 times in 4 days then that link is valid for that ‘snap-shot’ of the network. The next 4 days there may be 10 mentions and the following 4 days 0. Hence that link would have weights [5,10,0]. Following networks over time is network dynamics.

That may seem like an obvious sentence, but lets think a bit about it. ‘Network dynamics’ means how the network changes over time.

This can be thought of as two types of changes/dynamics, the changes/dynamics of the network, i.e. how the network structure changes over time (new user-nodes, new links, removed links, change to link weights) and the changes- dynamics on the network.

This second type are the processes that happen without structural change. The two things obviously depend on each other. This second part ventures into the world of non-linear dynamics, feedback loops and edges us to towards tipping points and chaos theory. I will stop here as the metrics we are talking about here are to do with network structure. The field of opinion dynamics explores how the processes governing these changes can be modelled. We will touch on this later, but it’s a bit beyond the scope of the ideas I want to introduce in this blog post.

One way to think about centrality can be like an average. If you think about the mean, median and mode as different ‘average’ measures or different ‘statistical properties’ of a group of things (an ensemble), as we described with the birthday party example, then you can similarly think about centrality as a property of a network.

There are many types of centrality. The most simplistic is degree centrality. For each node in the network we count the number of links it has. If the network is weighted then we sum the weights.

If the network were of users linked as followers, then the centrality of each node is equal to the number of users who follow them (directed un-weighted network).

Using volume of tweets multiplied by followers is the same as taking the node properties of volume and multiplying it by the degree centrality measure of follower network. As a statistical measure, this intuitively makes sense, but in actuality we know we want to look at engagement networks. Hence we can use centrality of difference networks as a proxy for influence, we can also rank the users based on this, or even divide them into groups of the more important or the less important (k-cores is a simple algorithm for doing this).

When Centrality is not so Useful

Whilst ranking users and grouping them based on counts of their different properties can be useful, centrality can be deceptive in telling us how important structurally the user is to passing information.

Look at the pictures below, the degree-centrality of node b is much lower than a or c (it has fewer connections), but it is obviously much more important, as it allows the red and blue nodes to be connected to each other.

Picture2

Getting Rid of Unknown Unknowns

Networks can be used to model a whole load of things, from the spread of diseases (epidemiology), opinion dynamics as we start to mention here, to bank or financial defaults when each bank is connected to the other through loans or other obligations. Many fields make use of them.

If we look at a network its easy to imagine that the structure would change over time, new nodes, new links, some may disappear. Each of these changes may depend on each other and on some ‘system’ state. So we end up with a network changing over time, where the next change is a result of what it is like as a ‘whole’ and also what each part is like. Trying to understand how these will evolve over time and according to what rules is done with networks and also non-linear dynamics and agent-based models.

All these tools help to project what will happen next and what could happen in different scenarios. Essentially they help us move things from the ‘unknown unknown’ list to the ‘known unknown’ list and then to figure out how probable and under what conditions these things may happen.

In opinion dynamics, using the examples above we could try to engineer the system such that we spread a certain idea, or stop a negative one. It requires us to understand how the idea will spread over time and identifies the key players to target to get the effect we want. 

In banking, quantitative easing looks at the same approach, who is important systemically, who can you not afford to loose, who will cause the worst domino effect? The idea that the largest most connected bank can actually pose the most risk was ‘new’ to Economists in 2007. Now we have research Journals and policy on ‘Systemic risk’, contagion and networks.

Networks are a mathematical tool that can help us understand our world. They have had a huge impact in science, and could have a big impact in marketing.

 The aim here is to make network modelling of social networks the norm in marketing. It sounds strange to say out loud. There is an opportunity for marketers here.  Customer segmentation methods rely on descriptive statistics. You find consumers who match a set of characteristics to other consumers and you target them. However to move past descriptive statistics and start engineering social media, this is a space that offers a different opportunity.

 

 

 

Charlotte is a Data Scientist Researcher with a PhD in Engineering Maths and two Masters degrees: one in Complex Systems and one in Earth Sciences. After thinking a lot about systemic risk in economics and finance, she now focuses on finding the right mathematical tools for our algorithms.