Part 51 Case Study 2: Reviews on Product

Below we are going to look at different reviews on Amazon for candles.

Let’s read in our dataset.

# library(tidyverse)
# library(tidytext)

scented_reviews <-
  readxl::read_excel(here::here("data", "Scented_all.xlsx")) |>
  mutate(review_id = row_number()) |>
  janitor::clean_names() |> # A function that changes the column names to snake_case/cleans them up a little
  mutate(rating = as_factor(rating)) 

51.0.1 Clean Text

First step is to unnest_tokens() our reviews.

unnest_scent <-
  scented_reviews |>
  unnest_tokens("word", "review",
    token = "words"
  )

51.0.2 Remove Stop-words

We then remove our normal stop words.

unnest_scent <-
  unnest_scent |>
  anti_join(stop_words, by = "word")

51.0.3 Removing Custom Stop-words

Usually there are words that we know are more “noise” than “signal”. In order to determine if a word is “noise” we can implement different strategies.

One strategy is to look through all unique words and skim to find one’s not offered in the stop words lexicon already.

unnest_scent |>
  distinct(word) |>
  head()

Another strategy is to look at the most frequent words and determine if the word is useful or not.

unnest_scent |>
  count(word) |>
  arrange(desc(n)) |>
  head()

Now once we decide which to remove we can put it together in a tribble()

custom_stopwords <-
  tribble(
    ~word,       ~lexicon,
    "candle",    "custom",
    "candles",   "custom",
    "smell",     "custom",
    "smells",    "custom",
    "scent",     "custom",
    "fragrance", "custom",
    "yankee",    "custom",
    "love",      "custom", # This word shows up so much it acts like a stop word
    "sooo",      "custom", # This is an example of using text from the internet
    "soooo",     "custom",
    "doesn",     "custom",
    "don",       "custom"
  )

We use the same method to remove our custom_stopwords as we do with regular ol’ stop_words

unnest_scent <-
  unnest_scent |>
  anti_join(custom_stopwords, by = "word")

51.0.3.1 Remove Numbers/Non-alphabet characters

Depending on the type of data you’re working with there might be more than just characters you need to remove. Since we’re working with data from an online source, there are some weird characters that appear as well as digits used. Using the stringr package, we can clean it up.

unnest_scent <-
  unnest_scent |>
  mutate(word = str_replace(word, pattern = "[^\x20-\x7E](.*)", replacement = "")) |> # Remove weird characters
  filter(
    !str_detect(word, "[0-9]+"), # Remove digits
    word != ""
  )

51.0.3.2 Removing words that only show up < 10 times.

A common way to reduce the amount of noise within a corpus is to focus on words that show up more than n number of times.

unnest_scent <-
  unnest_scent |>
  add_count(word, name = "n_total") |>
  filter(n_total >= 10)

51.0.4 Word Frequency

51.0.5 Sentiment analysis

Let’s get a feel for our data (someone stop me, I’m on a roll).

We are going to work with the BING and NRC lexicons.

51.0.5.1 BING

Just a reminder, BING uses a positive/negative coding of each word.

bing <- read.csv(here::here("data","bing_dictionary.csv"))
bing_scent <-
  unnest_scent |>
  inner_join(bing, by = "word")
51.0.5.1.1 Overall
bing_scent |>
  count(sentiment) |>
  ggplot() +
  aes(
    x = sentiment, y = n,
    fill = sentiment
  ) +
  geom_col() +
  labs(
    title = "Overall BING Sentiment of Customer Reviews",
    y = "Number of Words"
  )

According to the graph above, overall we have a lot more positive words than negative ones within our reviews.

51.0.5.1.2 By Rating

Now it might be useful to know how the number of positive and negative words change within each rating given.

bing_scent |>
  group_by(rating) |>
  count(sentiment) |>
  ggplot() +
  aes(x = sentiment, y = n, fill = sentiment) +
  geom_col() +
  facet_wrap(~rating, scales = "free") +
  labs(
    title = "BING Sentiment of Customer Reviews by Rating",
    y = "Number of Words"
  )

Intuitively, we expect for an inverse relationship between the rating and number of positive/negative words. To put another way, we expect people who rate the candle lower to express/use more negative sentiment/words

51.0.5.2 NRC

Reminder: NRC codes words with different emotions.

nrc <- read.csv(here::here("data","nrc_dictionary.csv"))
nrc_scent <-
  unnest_scent |>
  inner_join(nrc, by = "word")
51.0.5.2.1 Overall

Let’s take a peek at the over amount of sentiment expressed within our reviews.

nrc_scent |>
  count(sentiment) |>
  ggplot() +
  aes(x = fct_reorder(sentiment, n), y = n, fill = sentiment) +
  geom_col() +
  labs(
    title = "Overall NCR Sentiment of Customer Reviews",
    y = "Number of Words",
    x = "Sentiment"
  )

Similar to our BING analysis, we have a higher use of positive-type words than negative.

51.0.5.2.2 By Rating

Again, its useful to take it within the context of what the review had in rating.

nrc_scent |>
  group_by(rating) |>
  count(sentiment) |>
  ggplot() +
  aes(x = fct_reorder(sentiment, n), y = n, fill = sentiment) +
  geom_col() +
  facet_wrap(~rating, scales = "free") +
  labs(
    title = "NCR Sentiment of Customer Reviews by Rating",
    y = "Number of Words",
    x = "Sentiment"
  ) +
  coord_flip()

Directing our attention to ratings 1 and 5, we see the same relationship with the BING analysis.

51.0.6 Weighted log-odds

Let’s look at what words are used between ratings using the weighted log-odds.

First we use the helpful bind_log_odds() function from the tidylo package.

library(tidylo)

scent_log_odds <-
  unnest_scent |>
  group_by(rating) |>
  add_count(word, name = "n_rating") |> # count each word within the ratings
  distinct(word, .keep_all = TRUE) |> # we only need the word (`word`) and how many times it shows up (`n_rating`)
  bind_log_odds(rating, word, n_rating)

Now that we have our metrics lets look at important words for each rating. Let’s focus on the top 15 words used to reduce clutter.

scent_log_odds |>
  group_by(rating) |>
  top_n(15, log_odds_weighted) |>
  ggplot() +
  aes(x = fct_reorder(word, log_odds_weighted), y = log_odds_weighted, fill = factor(rating)) +
  geom_col() +
  facet_wrap(~rating, scales = "free_y") +
  coord_flip() +
  labs(
    x = "",
    y = "Weighted Log Odds",
    title = "Top 15 words with highest weighted log odds"
  ) +
  theme(legend.position = "none")

For our higher rated reviews, we see use of “yummy”, “excellent”, “affordable”. Now we also see “law”? Can you think of any reason why we would see that word?

For lower rated reviews, we see “broken”, “unusable”, and “waste”.

A interesting way to think about “importance” is to compare between a word within a group and the whole corpus. To do this we look at how much the word happens overall and the weighted log odds.

scent_log_odds |>
  top_n(10, n_rating) |>
  ggplot() +
  aes(x = n_total, y = log_odds_weighted, label = word, color = factor(rating)) +
  geom_point() +
  ggrepel::geom_text_repel(max.overlaps = 50) +
  labs(
    title = "How important is a word within a rating versus within the whole corpus?",
    subtitle = "Top 10 words per Customer Rating",
    x = "Count of occurances within corpus",
    y = "Weighted Log odds",
    color = "Rating"
  ) +
  geom_hline(yintercept = 0, lty = 2, alpha = 0.2, size = 1.2)

Lets digest what the above graph is saying.

Starting with the word “broken” we see that within a rating of 1, “broken” is extremely important. However as we look at the second point for “broken” we see that it is a lot less important for rating 2, similarly with rating 3.

We can also compare different words with similar importance. For example, the word disappointed within a rating of 2 is around the same importance as wonderful is within a rating of 5.

Now because there are 5 ratings on this graph, it can be less work if we just look at ratings of 1 and 5.

scent_log_odds |>
  filter(rating %in% c(1, 5)) |>
  top_n(20, n_rating) |>
  ggplot() +
  aes(x = n_total, y = log_odds_weighted, label = word, color = rating) +
  geom_point() +
  ggrepel::geom_text_repel(max.overlaps = 40) +
  labs(
    title = "How important is a word between a rating versus within the whole corpus?",
    subtitle = "Reviews with a Rating of 1 or 5; Top 20",
    x = "Count of occurances within corpus",
    y = "Log odds"
  ) +
  geom_hline(yintercept = 0, lty = 2, alpha = 0.2, size = 1.2)

Above we get a clearer picture of the importance of words between ratings of 1 and 5.

51.1 Contributions

Julia Silge deserves a huge amount of credit for not only a great outline of text analysis within R but also for her inspiration.

Thank you to the following for your support and guidance: Alex Denison, Loni Hagen, Mary Falling.

51.2 References

Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data (First edition). O’Reilly.

Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach (First edition). O’Reilly.