Part 50 Mining

Now that we have gone over how to wrangle/shape our data, it’s time to talk about some useful quantitative measure of text that help us get a better understanding of whats going on within our text.

50.1 Sentiment Analysis

Feeling is a big part of communication. Some words have different emotions/feelings behind them. That may be a interest of ours when looking at text.

One simple approach we can take is simply having a list of a bunch of words labeled with different emotions. We treat our text as a bag-of-words and just match words to a dataset with those word and sentiment. This is a lexicon-based approach (similar to stop words lexicon based approach).

There exists different lexicons with different ways of expressing sentiment. We are going to look at some lexicons found in the textdata package that work nice within tidytext.

textdata lexicons:

  • Bing: Expresses words in a binary positive or negative fashion

    • Positive or negative
  • AFINN: Expresses words with a value (-5 to 5) according the the word’s valence

    • -5 to 5
  • NRC: Expresses words with the following sentiments:

    • Negative
    • Positive
    • Anger
    • Anticipation
    • Disgust
    • Fear
    • Job
    • Sadness
    • Surprise
    • Trust

50.1.1 Looking at our Lexicons

Using the tidytext::get_sentiments() function we can take a peek at the words within. We’ll stick to manually reading it in as a .csv to practice our skills.

50.1.1.1 Bing - Positive vs. Negative

bing <- read.csv(here::here("data","bing_dictionary.csv"))
bing |> 
  dplyr::count(sentiment)

Here we can see how many words are positive and negative within our lexicon

50.1.1.2 AFINN - Words on a -5 to 5 scale

afinn <- read.csv(here::here("data","afinn_dictionary.csv"))
afinn |>
  mutate(value = as_factor(value)) |> # make our value a factor
  ggplot() +
  aes(x = value) +
  geom_histogram(stat = "count") # tell ggplot to count()

From the above plot we can get a picture of the count of each value. In other words, we see that we have a lot of words with a -2 rating.

If we wanted specific counts we can count() the value ourselves

afinn |> 
  mutate(value = as_factor(value)) |> 
  count(value)

50.1.1.3 NRC - Anger, Anticipation, …

nrc <- read.csv(here::here("data","nrc_dictionary.csv"))

This lexicon has multiple representations of a word with different sentiments

The word “abandon” has “fear”,“negative”,“sadness” attached to it.

Let’s get a feel (ba-dum-tsh) for our lexicon:

We can see our unique words count()ing them:

nrc |> 
  count(word) |> # automatically collapse the same word
  nrow()

Here we can see how many different sentiments are attached to each word.

nrc |> 
  count(sentiment) |> 
  ggplot() +
  aes(x = sentiment, y = n) +
  geom_col()

50.1.2 Using inner_join() with our bing lexicon

Let’s revisit our unnest_df from earlier

unnest_df |> head()

Using inner_join() we can match words from our lexicon to our data.

Note: Remember to have the same column name as both your data and lexicon for the words.

unnest_df |> 
 inner_join(bing, by = "word")

From the above output we can see that line 2 had 2 positive words while line 3 had a mix of 1 positive word and 1 negative word.

50.1.3 Important things with sentiment analysis

Now there are a couple things to consider when deciding on which lexicon to use other than just the depth/type of sentiment we want to look at.

Here are a some questions to keep in mind:

  • What was the lexicon developed for?

    • Reviews?

    • Tweets?

  • How was the lexicon developed?

    • Through researchers?

    • Crowd sourced?

  • How old is the lexicon?

    • Words change meaning over time

    • Culture of the words used?

Another thing to consider is that lexicon sentiment analysis is a bag-of-word type of analysis. Context isn’t considered when looking at words. If you’re interested in looking at different techniques look into n-grams

50.2 Metrics of Text

Let’s dig into some quantitative measures of text data. Two main metrics we will consider in this chapter are:

  • TF-IDF

  • Weighted log-odds ratio.

50.2.1 Term-Document Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term-Document Inverse Document Frequency. It’s main goal is to measure how “important” a word is in our corpus (Silge & Robinson, 2017). TF-IDF is useful when we want to understand the terms in the corpus as a whole.

Let’s break down TF-IDF into it’s two parts:

  • Term-frequency: How many times a term/word occurs?

  • Inverse-Document Frequency: The inverse of the number of times a word shows up in a collection of documents.

\[ idf(term) = ln(\frac{n_{documents}}{n_{documents\ containing\ terms}}) \]

This is based off of the idea that the more a term shows up, the less important that term is.

50.2.1.1 Zipf’s Law

TF-IDF is inspired by George Zipf’s law.

Zipf’s law states the following:

the frequency that a word appears is inversely proportional to its rank.

This makes sense if we think about how language is. We would expect words that don’t show up often to carry more weight and be more important than words that show up all the time.

50.2.1.2 Example

Let’s calculate the TF-IDF with bind_tf_idf()

We need our data set to have one-row per document-term:

tidy_df <-
  unnest_df |> 
  count(line, word, sort = TRUE)

tidy_df |> head()

Now we can bind_tf_idf()

tidy_tf_idf_df  <- 
  tidy_df |> 
  bind_tf_idf(word, line, n)

tidy_tf_idf_df

We can see the top TF-IDF words are: “words”, “feelings”, “express”, “excitement”, and “anxious” while our lowest are “learn” and “day.”

50.2.1.3 Visualize TF-IDF

tidy_tf_idf_df |> 
  ggplot() +
  aes(x = fct_reorder(word, tf_idf), # order `word` by `tf_idf`
      y = tf_idf,
      fill = as_factor(line)) + # color :-)
  geom_col() +
  coord_flip() + 
  facet_wrap(~ line, scales = "free_y") + 
  labs(x = "",
       fill = "Line #")

Above we get a nice graph of our words and their TF-IDF.

50.2.2 Weighted Log-odds

Sometimes we are interested in comparing words between different groups within a corpus. Here are some examples:

  • Corpus of scientific articles from different fields (e.g., economics, medicine, technology):

    • What words are you most likely to see given the field?
  • Corpus of two twitter users:

    • What words are you more likely to see from each user?
  • Corpus of a collection of products and their respective reviews:

    • What words are you more likely to see from each product/rating?
  • Corpus of different baby’s names from different time periods (1960s, 1970s, …)

    • What names were most prevalent in each year compared to other years?

To do this we can use the weighted log-odds ratio. Let’s break down what “weighted log-odds ratio” means.

50.2.2.1 Odds Ratio

Odds ratio, simply, is the ratio of something happening divided by that thing not happening.

\[ Odds\ Ratio = \frac{x\ happening}{x\ not\ happening} \]

This is different from probability. Probability can be thought of the ratio of x happening with all possible happenings.

50.2.2.2 Log-odds ratio

The log-odds ratio is just the odds ratio with a log transformation (natural log).

Reason being, it allows for some pretty nice behavior like:

  • Symmetry around zero

  • Transforms our range to be between \([-\infty, \infty]\)

\[ log(Odds\ Ratio) = log(\frac{x\ happening}{x\ not\ happening}) \]

50.2.2.3 Weighted log-odds ratio

The bind_log_odds() function fits a posterior log odds ratio assuming a multinominal model with a Dirichlet prior. (See ?bind_log_odds for more info)

To look into the actual method behind the function see Monroe, Colaresi & Quinn 2007

50.2.3 Example

Using bind_log_odds() we can specify different arguments to tweak our output.

library(tidylo)

# from earlier:
# tidy_df <-
#   unnest_df |> 
#   count(line, word, sort = TRUE)


log_odds_df <- 
  tidy_df |> 
  bind_log_odds(line, word, n,
                uninformative = TRUE, # TRUE = Do not fit with a Dirichlet Prior 
                unweighted = TRUE) # TRUE = Add a unweighted column

50.2.3.1 Visual

We can now visualize the different weighted log odds for each line

log_odds_df |> 
  ggplot() +
  aes(x = fct_reorder(word, log_odds_weighted), y = log_odds_weighted, fill = as_factor(line)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ line, scales = "free_y") +
  labs(x = "",
       fill = "Line #")