Black Swan’s Data Science Lead Architect Henrietta Eyre on the complexities of Sentiment Analysis, and how to build a sentiment classifier.

Everyone in social media analytics is interested in sentiment analysis.

Sentiment analysis refers to the area of natural language processing where the polarity of a text document is automatically determined. In a sentiment classifier, the algorithm takes the document as an input, and as an output provides a sentiment rating as either positive, negative or neutral.

Accurate sentiment rankings are incredibly useful for analysing your social media content. It takes you beyond the realms of searching for keywords human intuition tell you may indicate sentiment, and allows you to tag all text with a sentiment rating, regardless of whether you have seen the likes of its content before or not.

So how can we build a sentiment classifier?  A widely-used algorithm is the Naïve Bayes Classifier. It has a theoretical backbone of Bayes’ Theorem (recently covered in Jude Bowyer’s blog post), and if we pass a set of numbers to the algorithm then it will look for the boundary within a vector space for the different sentiment classes, and will label our input accordingly. Now we have a text input, and so the numerical input is determined by counting up the frequency of the words in the input text.

So “We love Black Swan and we love data science” would be turned into the following form:

{“we”: 2, “love”:2, “black”: 1, “swan”:1, “and”:1, “data”:1, “science”:1 }

and the numbers would be used to calculate our sentiment label.

This method has been shown to work very well for lots of text forms. When we have longer text expressing a given sentiment, the counts of words which the algorithm has learned to associate more strongly with a positive or negative sentiment allow it to give unseen text a sentiment label with good accuracy. However in the space of social media we find that this model does not work so well.

Let’s take Twitter as an example data source. 140 characters is not very much space for our algorithms to work with and the Naïve Bayes algorithm would struggle to find a sentiment class for each message with a high accuracy.

Instead we have to look a bit deeper into the way that people engage with social media, and pick apart the way in which they use language differently. Here are some of the insights we found within Twitter data:

sky

 

weighed

bones

 

We found that users elongate their words in order to emphasise their sentiment. So whereas ‘I love the sky’ would indicate a positive sentiment, the user elongating ‘love’ to ‘loooove’ shows a heightened positive sentiment.

clint2

We found that users also use extension of punctuation to accentuate a message’s sentiment. In the above example, the user has used three exclamation marks. This can be paired with ‘5-STAR RATING’ which is expressing a positive sentiment, to indicate a heightened positive sentiment.

Emoticons are extremely important in representing sentiment. But we have found to be careful, we cannot simply map a happy emoticon to positive sentiment and an unhappy emoticon to negative sentiment. Sometimes emoticons when paired with an opposing sentiment in the text can be an indicator of sarcasm.

We can also find opposing sentiments in the same message:

twitter1

Here the user is expressing negative sentiment towards being ill, but positive sentiment to a new user following them. How do we deal with assigning a sentiment tag to the whole message?

These are all factors which need to be taken into account when building a sentiment algorithm, and the golden rule is that there is no ‘one size fits’ all algorithm to handle sentiment. Every data channel will require its own model, and your models will need to evolve as social media habits evolve.

Perhaps this is why sentiment is a problem we love to work on, we love a challenge!

Ettie is a trained mathematician with a PhD in artificial intelligence. After a stint in academia she found her way to industry and now works on crunching natural language processing algorithms at Black Swan.