January 1, 2022

What is Sentiment Analysis?

By John Emery

"This is the worst!" cried Reginald. Cynthia, his long-time business partner, hurried into the room. "Is everything alright?" she asked. Reginald, usually an exuberant man, sighed and said "we lost the Smithfield account. I don't know if we can pay holiday bonuses this year." Cynthia hung her head and said "we've been through worse, and times always get better. We can make it."

In the dramatic scene above, our hero Reginald is disheartened after losing a valuable account, meaning his employees may miss out on a generous holiday bonus. As English readers, we can read the paragraph and understand the overall sentiment: that of sadness, frustration, and despondency. Our brains are able to understand the context and we can relate to the emotions the characters feel. This is the human version of sentiment analysis: reading a piece of text and understanding the emotions that it imparts to the reader.

We can define sentiment analysis as the systematic identification, extraction, quantification, and analysis of subjective text. Currently, there is a tremendous amount of research going towards developing more advanced sentiment analysis techniques. Companies rely on sentiment analysis for quantifying free form survey results, reporting on the sentiment of articles written about the company, and for analyzing customer feedback, to name a few areas of interest.

What makes sentiment analysis so hard?

While we can read a piece of text and understand the underlying sentiment, this is an extremely difficult problem for computers to solve. Looking word-by-word through the text above, we can see some positive and some negative words:

Positive:

  • Alright
  • Exuberant
  • Holiday
  • Bonuses
  • Better

Negative:

  • Worst
  • Cried
  • Lost
  • Hung
  • Worse

Although the overall sentiment is clearly negative to us, a naïve sentiment analysis algorithm will see a fairly balanced number of generally positive and negative words and say that the text is neutral in tone. Let’s look at a few examples of the vagaries of human language that can make sentiment analysis difficult:

  • I do not hate their team. (Negation)
  • I did stupendously well on my exam! (Adverb modifier)
  • This is just what I needed today. (Sarcasm)
  • John Carpenter’s The Thing has a creepy atmosphere perfect for a classic horror film. (The word creepy in this context is used as a positive term, but in other contexts would be considered negative: That guy is so creepy!)
  • Bruh, that bull yeeted the cowboy ten feet into the air! I can’t even! (Using new or slang terms whose sentiments are not well-defined or are highly variable)

Until recently, sentiment analysis was firmly in the realm of academics and scientific researchers. Advanced knowledge of mathematics, linguistics, and computer programming were required to build any sentiment model.

As the popularity — and use cases — of sentiment analysis has increased, it has become ever more user-friendly. Drag-and-drop tools allow non-experts to build sentiment analysis models quickly.

Sentiment Analysis: A Use Case

A major US-based health care provider asked us to build a predictive model that leveraged a sentiment analysis algorithm.

The client had a service that gave them a data set of articles written about the company from publishers such as the New York Times, USA Today, local news and national news stations, and even international vendors. Previously, they had a member of their staff read each article and assign it a grade based on a rubric.

Unfortunately, over time they found that this scoring method was highly inconsistent and subjective. The same article given to three different graders would likely receive three different scores. They turned to Tessellation to build a model that would offer consistent and objective results.

Our process consisted of three steps:

  1. Cleaning and preparing the data: We transformed the raw text data by splitting each sentence and word into separate records. This allowed us to more easily clean any messiness in the data prior to passing it into the sentiment analyzer. Using additional fields provided us, we created dozens of predictor variables based on information such as the publisher, if the article included key information the client identified, and more.
  2. Score the sentiment of each article, title, and headline: Applying well-known sentiment analysis algorithms, we passed the raw text to obtain quantified sentiment data for each article. A strongly negative article may score a -10, whereas a highly positive article may score a +10. Scores near 0 indicate neutrality in the text.
  3. Pass all predictor variables into a model: Using the dozens of predictor variables (including the sentiment scores), we trained a model using the client’s past scoring. They provided us with nearly 10,000 scores, so we were able to build a reasonably robust model using their historical data. The model would then output a score for each new article as negative, positive, or neutral.

By using established data science and machine learning principles (such as one-hot encoding), we built a consistent, reproducible model that allowed the client to quickly score a large number of articles objectively.

Sentiment Packages

Simple sentiment analyses use a dictionary of words that have been assigned a point value. Words like “excellent” and “magnificent” express a strong positive and are given relatively large, positive values. Words that are generally negative, like “cancerous” and “bilious” will have correspondingly negative scores.

One of the most important aspects of performing a sound sentiment analysis is using the right dictionary. A dictionary that is built to analyze Tweets will be very different from one built to analyze medical texts.

A very popular and easy-to-use sentiment analysis package for R and Python is known as VADER. In its documentation, the author writes:

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

Because VADER’s dictionary is attuned to social media, it should perform well when scraping Twitter data, but perhaps not as well if it is used to scrape news articles.

At Tessellation, our data tool of choice is Alteryx. Leverage Alteryx’s Python SDK, we teamed up to build a tool that allows users to perform sentiment analysis using the VADER algorithm from Alteryx — no Python knowledge necessary! You can download it here.

Final Words

The great thing about our methodology is how simple it is. Using the latest drag-and-drop tools, there is no coding required (of course, you can add custom coding if you desire). This means that virtually any analyst can work on the model without fear of inadvertently altering code that he or she does not understand. Just as Prometheus gave fire to the humans in Greek mythology, our tools give access to advanced analytic processes formerly only available to researchers and academics.

Sentiment analysis is a tremendous tool, and virtually any company today will be able to find some use for this burgeoning field of analysis.

I will be speaking more in-depth on the sentiment analysis model that I built on December 16th, 2020.

Still have questions? Talk to our expert consultants today and have all your questions answered!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit