Part 51 Case Study 2: Reviews on Product
Below we are going to look at different reviews on Amazon for candles.
Let’s read in our dataset.
# library(tidyverse)
# library(tidytext)
<-
scented_reviews ::read_excel(here::here("data", "Scented_all.xlsx")) |>
readxlmutate(review_id = row_number()) |>
::clean_names() |> # A function that changes the column names to snake_case/cleans them up a little
janitormutate(rating = as_factor(rating))
51.0.1 Clean Text
First step is to unnest_tokens()
our reviews.
<-
unnest_scent |>
scented_reviews unnest_tokens("word", "review",
token = "words"
)
51.0.2 Remove Stop-words
We then remove our normal stop words.
<-
unnest_scent |>
unnest_scent anti_join(stop_words, by = "word")
51.0.3 Removing Custom Stop-words
Usually there are words that we know are more “noise” than “signal”. In order to determine if a word is “noise” we can implement different strategies.
One strategy is to look through all unique words and skim to find one’s not offered in the stop words lexicon already.
|>
unnest_scent distinct(word) |>
head()
Another strategy is to look at the most frequent words and determine if the word is useful or not.
|>
unnest_scent count(word) |>
arrange(desc(n)) |>
head()
Now once we decide which to remove we can put it together in a tribble()
<-
custom_stopwords tribble(
~word, ~lexicon,
"candle", "custom",
"candles", "custom",
"smell", "custom",
"smells", "custom",
"scent", "custom",
"fragrance", "custom",
"yankee", "custom",
"love", "custom", # This word shows up so much it acts like a stop word
"sooo", "custom", # This is an example of using text from the internet
"soooo", "custom",
"doesn", "custom",
"don", "custom"
)
We use the same method to remove our custom_stopwords
as we do with regular ol’ stop_words
<-
unnest_scent |>
unnest_scent anti_join(custom_stopwords, by = "word")
51.0.3.1 Remove Numbers/Non-alphabet characters
Depending on the type of data you’re working with there might be more than just characters you need to remove. Since we’re working with data from an online source, there are some weird characters that appear as well as digits used. Using the stringr
package, we can clean it up.
<-
unnest_scent |>
unnest_scent mutate(word = str_replace(word, pattern = "[^\x20-\x7E](.*)", replacement = "")) |> # Remove weird characters
filter(
!str_detect(word, "[0-9]+"), # Remove digits
!= ""
word )
51.0.3.2 Removing words that only show up < 10 times.
A common way to reduce the amount of noise within a corpus is to focus on words that show up more than n
number of times.
<-
unnest_scent |>
unnest_scent add_count(word, name = "n_total") |>
filter(n_total >= 10)
51.0.4 Word Frequency
51.0.5 Sentiment analysis
Let’s get a feel for our data (someone stop me, I’m on a roll).
We are going to work with the BING and NRC lexicons.
51.0.5.1 BING
Just a reminder, BING uses a positive/negative coding of each word.
<- read.csv(here::here("data","bing_dictionary.csv")) bing
<-
bing_scent |>
unnest_scent inner_join(bing, by = "word")
51.0.5.1.1 Overall
|>
bing_scent count(sentiment) |>
ggplot() +
aes(
x = sentiment, y = n,
fill = sentiment
+
) geom_col() +
labs(
title = "Overall BING Sentiment of Customer Reviews",
y = "Number of Words"
)
According to the graph above, overall we have a lot more positive words than negative ones within our reviews.
51.0.5.1.2 By Rating
Now it might be useful to know how the number of positive and negative words change within each rating given.
|>
bing_scent group_by(rating) |>
count(sentiment) |>
ggplot() +
aes(x = sentiment, y = n, fill = sentiment) +
geom_col() +
facet_wrap(~rating, scales = "free") +
labs(
title = "BING Sentiment of Customer Reviews by Rating",
y = "Number of Words"
)
Intuitively, we expect for an inverse relationship between the rating and number of positive/negative words. To put another way, we expect people who rate the candle lower to express/use more negative sentiment/words
51.0.5.2 NRC
Reminder: NRC codes words with different emotions.
<- read.csv(here::here("data","nrc_dictionary.csv")) nrc
<-
nrc_scent |>
unnest_scent inner_join(nrc, by = "word")
51.0.5.2.1 Overall
Let’s take a peek at the over amount of sentiment expressed within our reviews.
|>
nrc_scent count(sentiment) |>
ggplot() +
aes(x = fct_reorder(sentiment, n), y = n, fill = sentiment) +
geom_col() +
labs(
title = "Overall NCR Sentiment of Customer Reviews",
y = "Number of Words",
x = "Sentiment"
)
Similar to our BING analysis, we have a higher use of positive-type words than negative.
51.0.5.2.2 By Rating
Again, its useful to take it within the context of what the review had in rating.
|>
nrc_scent group_by(rating) |>
count(sentiment) |>
ggplot() +
aes(x = fct_reorder(sentiment, n), y = n, fill = sentiment) +
geom_col() +
facet_wrap(~rating, scales = "free") +
labs(
title = "NCR Sentiment of Customer Reviews by Rating",
y = "Number of Words",
x = "Sentiment"
+
) coord_flip()
Directing our attention to ratings 1 and 5, we see the same relationship with the BING analysis.
51.0.6 Weighted log-odds
Let’s look at what words are used between ratings using the weighted log-odds.
First we use the helpful bind_log_odds()
function from the tidylo
package.
library(tidylo)
<-
scent_log_odds |>
unnest_scent group_by(rating) |>
add_count(word, name = "n_rating") |> # count each word within the ratings
distinct(word, .keep_all = TRUE) |> # we only need the word (`word`) and how many times it shows up (`n_rating`)
bind_log_odds(rating, word, n_rating)
Now that we have our metrics lets look at important words for each rating. Let’s focus on the top 15 words used to reduce clutter.
|>
scent_log_odds group_by(rating) |>
top_n(15, log_odds_weighted) |>
ggplot() +
aes(x = fct_reorder(word, log_odds_weighted), y = log_odds_weighted, fill = factor(rating)) +
geom_col() +
facet_wrap(~rating, scales = "free_y") +
coord_flip() +
labs(
x = "",
y = "Weighted Log Odds",
title = "Top 15 words with highest weighted log odds"
+
) theme(legend.position = "none")
For our higher rated reviews, we see use of “yummy”, “excellent”, “affordable”. Now we also see “law”? Can you think of any reason why we would see that word?
For lower rated reviews, we see “broken”, “unusable”, and “waste”.
A interesting way to think about “importance” is to compare between a word within a group and the whole corpus. To do this we look at how much the word happens overall and the weighted log odds.
|>
scent_log_odds top_n(10, n_rating) |>
ggplot() +
aes(x = n_total, y = log_odds_weighted, label = word, color = factor(rating)) +
geom_point() +
::geom_text_repel(max.overlaps = 50) +
ggrepellabs(
title = "How important is a word within a rating versus within the whole corpus?",
subtitle = "Top 10 words per Customer Rating",
x = "Count of occurances within corpus",
y = "Weighted Log odds",
color = "Rating"
+
) geom_hline(yintercept = 0, lty = 2, alpha = 0.2, size = 1.2)
Lets digest what the above graph is saying.
Starting with the word “broken” we see that within a rating of 1, “broken” is extremely important. However as we look at the second point for “broken” we see that it is a lot less important for rating 2, similarly with rating 3.
We can also compare different words with similar importance. For example, the word disappointed within a rating of 2 is around the same importance as wonderful is within a rating of 5.
Now because there are 5 ratings on this graph, it can be less work if we just look at ratings of 1 and 5.
|>
scent_log_odds filter(rating %in% c(1, 5)) |>
top_n(20, n_rating) |>
ggplot() +
aes(x = n_total, y = log_odds_weighted, label = word, color = rating) +
geom_point() +
::geom_text_repel(max.overlaps = 40) +
ggrepellabs(
title = "How important is a word between a rating versus within the whole corpus?",
subtitle = "Reviews with a Rating of 1 or 5; Top 20",
x = "Count of occurances within corpus",
y = "Log odds"
+
) geom_hline(yintercept = 0, lty = 2, alpha = 0.2, size = 1.2)
Above we get a clearer picture of the importance of words between ratings of 1 and 5.
51.1 Contributions
Julia Silge deserves a huge amount of credit for not only a great outline of text analysis within R but also for her inspiration.
Thank you to the following for your support and guidance: Alex Denison, Loni Hagen, Mary Falling.
51.2 References
Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data (First edition). O’Reilly.
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach (First edition). O’Reilly.