Text & Sentiment Analysis

Author

Danil Kimbaev

#####Setting Up#####
library(tidyverse)
library(tidytext)
library(topicmodels)
library(topicdoc)
library(wordcloud)
library(kableExtra)

hotel <- read_csv("hawaiian_hotel_reviews.csv")

1 Introduction

This document covers text & sentiment analysis of a Hawaiian hotel’s reviews as well as topic modelling analysis of McDonalds reviews from across the USA.

2 Hawaiian Hotel Reviews Text & Sentiment Analysis

2.1 Top 20 most common words

The top 20 words that appear in the hotel’s reviews are seen below. However without context, these words have little meaning.

data(stop_words)

hotel_counts <- hotel %>% 
  unnest_tokens(word, review, token = "words") %>% 
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  top_n(20)

ggplot(hotel_counts) +
  geom_col(mapping = aes(x = n, y = reorder(word, n), fill = "#0099F8")) +
  labs(y = NULL)+
  theme_light() +
  theme(legend.position = "none")

2.2 Top 10 positive & negative sentiment words

Performing sentiment analysis makes the picture slightly clearer. The top 10 most commonly occurring positive & negative words are visible in the table below. However, once again without context we cannot make an accurate assumption as a simple prefix such as “not” could change the sentiment.

sentiments <- get_sentiments("bing")

hotel_sentiments <- hotel %>% 
  unnest_tokens(word, review, token = "words") %>%
  anti_join(stop_words) %>%
  inner_join(sentiments)

table1 <- hotel_sentiments %>%
  filter(sentiment == "positive") %>%
  count(word, sort = TRUE) %>%
  top_n(10)

table2 <- hotel_sentiments %>%
  filter(sentiment == "negative") %>%
  count(word, sort = TRUE) %>%
  top_n(10)

knitr::kable(table1, 
             align = "lr",
             caption = "Top 10 Words Associated With Positivity",
             col.names = c("Word", "Count"),
             table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black") %>%
    column_spec(1, bold = TRUE, color = "#0099F8")
knitr::kable(table2, 
             align = "lr",
             caption = "Top 10 Words Associated With Negativity",
             col.names = c("Word", "Count"),
             table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black") %>%
    column_spec(1, bold = TRUE, color = "#0099F8")
Top 10 Words Associated With Positivity
Word Count
nice 7274
clean 3574
beautiful 3560
friendly 2753
free 2563
recommend 2355
loved 2052
amazing 1940
helpful 1898
enjoyed 1867
Top 10 Words Associated With Negativity
Word Count
expensive 2809
crowded 2450
bad 1147
complex 1011
pricey 835
noise 790
disappointed 769
hard 729
cheap 575
overpriced 572

2.3 Sentiment change over time

The chart below showcases how the positive & negative sentiment has changed over time. It is evident that there is far more positive sentiment rather than negative, however there is a larger decrease in the positive sentiment. This implies that customers are growing more dissatisfied with the hotel.

hotel_sentiments <- hotel %>% 
  unnest_tokens(word, review, token = "words") %>%
  anti_join(stop_words) %>%
  inner_join(sentiments)

hotel_sentiments <- mutate(hotel_sentiments, block = id%/%150)

hotel_blocks <- hotel_sentiments %>%
  group_by(block) %>%
  count(sentiment)

ggplot(hotel_blocks) +
  geom_col(mapping = aes(x = block, y = n), fill = "#0099F8") +
  facet_wrap(~ sentiment, nrow = 1) +
  ylab("# Sentiments") +
  xlab("") +
  ylab("Sentiment")

  theme_light()

2.4 Most commonly occuring bigrams

Below is a table featuring the most commonly occuring bigrams within the reviews. It is evident that customers are frequently mentioning land marks such as “Rainbow Tower” or “Waikiki Beach”. The hotel’s amenities are also commonly mentioned such as “Private Pool” or “Breakfast Buffet”

hotel_bigrams <- hotel %>% 
  unnest_tokens(bigram, review, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  unite(bigram, word1, word2, sep = " ")

table3 <- count(hotel_bigrams, bigram, sort = TRUE) %>%
  top_n(30)

knitr::kable(table3, 
             align = "lr",
             caption = "Top 30 Bigrams",
             col.names = c("Bigram", "Count"),
             table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black") %>%
    column_spec(1, bold = TRUE, color = "#0099F8")
Top 30 Bigrams
Bigram Count
rainbow tower 3567
hawaiian village 2909
hilton hawaiian 2821
ocean view 2332
diamond head 2180
waikiki beach 1710
tapa tower 1625
ali'i tower 1583
front desk 1328
resort fee 992
walking distance 973
friday night 934
abc store 913
ala moana 894
kalia tower 894
hilton honors 714
ocean front 648
head tower 581
highly recommend 580
abc stores 539
super pool 517
minute walk 485
alii tower 476
tropics bar 458
customer service 449
partial ocean 437
private pool 424
north shore 422
breakfast buffet 412
moana shopping 390

2.5 Most commonly occuring trigrams

Looking at the most commonly occurring trigrams unveals more information. Once again, the majority of trigrams mention various landmarks such as “Diamond Head Tower” or “Hilton Hawaiian Village”. This in combination with trigrams such as “Easy Walking Distance” or “10 Minute Walk” potentially indicates that the hotel in a good location. However, the 3rd most common trigram “Partial Ocean View” seems slightly worrying, as it potentially indicates a pain point for customers.

hotel_trigrams <- hotel %>% 
  unnest_tokens(trigram, review, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word3 %in% stop_words$word) %>%
  unite(bigram, word1, word2, word3, sep = " ")

table4 <- count(hotel_trigrams, bigram, sort = TRUE) %>%
  top_n(30)

knitr::kable(table4, 
             align = "lr",
             caption = "Top 30 Bigrams",
             col.names = c("Bigram", "Count"),
             table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black") %>%
    column_spec(1, bold = TRUE, color = "#0099F8")
Top 30 Bigrams
Bigram Count
hilton hawaiian village 2614
diamond head tower 575
partial ocean view 389
ala moana shopping 365
friday night fireworks 358
round table pizza 205
moana shopping centre 171
ala moana mall 147
front desk staff 144
10 minute walk 137
hawaiian village waikiki 136
15 minute walk 135
moana shopping center 126
wailana coffee house 121
day resort fee 115
village waikiki beach 109
waikiki beach resort 108
hawaiian village resort 95
rainbow tower ocean 90
2 double beds 89
king size bed 88
facing diamond head 85
tropics bar grill 80
hilton hawaiin village 79
diamond head view 75
daily resort fee 70
hawaii 5 0 70
30 resort fee 69
easy walking distance 69
ice cream shop 69

2.6 Summarising reviews containing various key terms

2.6.1 Lagoon reviews

Reviews contaning the word “Lagoon” contained valuable insight about the hotel & resort. Customers reported that staff were very friendly and helpful, as well as freebies & upgrades being common. The hotel has various towers in which holidaymakers stay in that have beautiful views & each with its own super pool. The hotel resort is huge & has everything a vacationer needs - they wouldn’t even have to leave the resort if they didn’t want to. The hotel sits on the ocean-side, however the beach is overcrowded. Lots of shops nearby with the ABC store nearby being the main choice for snacks and drinks.

lagoon_reviews <- filter(hotel, str_detect(review, regex("lagoon", ignore_case = TRUE)))

write_csv(lagoon_reviews, "lagoon_reviews.csv")

2.6.2 Rainbow tower reviews

Reviews containing the words “Rainbow Tower” were quite similar, but with new pieces of information. Again, the views from the towers were reported as beautiful, friendly staff & good rooms however nothing out of the ordinary. There are also many restaurants in the surrounding area & the hotel/resort grounds are beautiful. However, it was reported that everything is expensive & the hotel charges for every single thing.

rainbowtower_reviews <- filter(hotel, str_detect(review, regex("rainbow tower", ignore_case = TRUE)))

write_csv(rainbowtower_reviews, "rainbow_reviews.csv")

2.6.3 Al Moana shopping reviews

Reviews containing the words “Al Moana Shopping” uncovered even more information. The reviews contained similar information to what is reported in i) & ii) regarding views, service and rooms - however it is revealed that the hotel is close to a big shopping center called Al Moana. The shopping center contains an expansive food court as well as many stores including various designer stores. Reviewers mention that the street outside the hotel is very busy & hotel food/drinks are a “complete rip off”.

alamoanashopping_reviews <- filter(hotel, str_detect(review, regex("ala moana shopping", ignore_case = TRUE)))

write_csv(alamoanashopping_reviews, "ala_moana_shopping.csv")

2.7 Word clouds

Below are word clouds showcasing the most common positive & negative terms. The positive word cloud showcases words which appear over 1000 times whilst the negative word cloud showcases words which appear over 500 times.

Positive

hotel_sentiments2 <- hotel %>% 
  unnest_tokens(word, review, token = "words") %>% 
  anti_join(stop_words) %>%
  inner_join(sentiments) %>%
  count(word, sentiment, sort = TRUE)

hotel_pos_sentiments <- filter(hotel_sentiments2, sentiment == "positive")

wordcloud(hotel_pos_sentiments$word, 
          hotel_pos_sentiments$n, 
          min.freq = 1000, 
          colors = brewer.pal(8, "Dark2"))

Negative

hotel_neg_sentiments <- filter(hotel_sentiments2, sentiment == "negative")

wordcloud(hotel_neg_sentiments$word, 
          hotel_neg_sentiments$n, 
          min.freq = 500, 
          colors = brewer.pal(8, "Dark2"))

3 McDonalds Reviews Topic Modeling Analysis

Utilizing the Collapsed Gibbs Sampling method, an LDA algorithm was created to perform topic modeling analysis on McDonalds restaurant reviews. A grid of bar charts showcasing various potential topics within the reviews is seen below.

##a##
mcds <- read_csv("mcdonalds_reviews.csv")

data(stop_words)

mcds_count <- mcds %>% 
  unnest_tokens(word, review, token = "words") %>% 
  anti_join(stop_words) %>%
  count(id, word, sort = TRUE)
  
mcds_dtm <- cast_dtm(mcds_count, id, word, n)

##b##
mcds_lda <- LDA(mcds_dtm, method = "Gibbs", k = 14, control = list(seed = 1234))

mcds_lda_beta <- tidy(mcds_lda)

mcds_lda_top_terms <- mcds_lda_beta %>%
  group_by(topic) %>%
  slice_max(beta, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

##c - This includes the Visual evaluation and Numeric Evaluation. Not labelled to make report look cleaner##
mcds_lda_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 10 terms in each LDA topic", x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 6, scales = "free")

3.1 Visual evaluation

Below is a summary of each topic.

Topic 1 - Appears to be describing something got to do with the drive through.

Topic 2 - Something related to the manager being called to change a meal. Does not seem coherent.

Topic 3 - This topic describes an experience happening with the the staff. Does not seem coherent.

Topic 4 - Possibly describes the ice cream machine running 24 hours, night or day.

Topic 5 - Seems to be describing a drink order being wrong 2 times.

Topic 6 - Something related to homeless people around the restaurant. Potentially making eating unpleasant.

Topic 7 - This topic is negative. It is quite coherent and is describing bad customer service, with rude employees and a dirty premises.

Topic 8 - Describes the McDonalds breakfast menu. Does not provide anything actionable

Topic 9 - This topic entails long waiting times, queues and multiple people waiting.

Topic 10 - This topic does not seem coherent, and describes something to do with the location of the store.

Topic 11 - This topic seems positive. It appears to describe a clean premises, friendly staff.

Topic 12 - Appears to be about the kids area within the restaurant, and possibly the tables being clean.

Topic 13 - This topic seems negative, mentioning various menu items being cold.

Topic 14 - This topic seems to be centered around fast food being quick and cheap however poor quality.

3.2 Numeric evaluation

A table with various diagnostic figures is featured below. Using the figures in this table, the quality of each topic can be assessed.

topic_quality <- topic_diagnostics(mcds_lda, mcds_dtm)

knitr::kable(topic_quality, 
             align = "lr",
             caption = "Topic Diagnostics",
             table.attr = 'data-quarto-disable-processing = "false"') %>%
  kable_styling(full_width = T) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black") %>%
    column_spec(1, bold = TRUE, color = "#0099F8")
Topic Diagnostics
topic_num topic_size mean_token_length dist_from_corpus tf_df_dist doc_prominence topic_coherence topic_exclusivity
1 570.5098 4.0 0.6040340 4.617374 11 -144.1279 9.956960
2 663.8371 5.2 0.6142948 2.460084 8 -170.1142 9.971655
3 654.1112 5.9 0.6027599 2.612595 5 -171.5748 9.906536
4 676.0556 4.2 0.6160793 2.406061 7 -150.8710 9.957347
5 591.0666 4.6 0.6057051 3.801706 6 -142.9228 9.908269
6 610.0399 6.0 0.6147538 5.246863 4 -168.4202 9.938810
7 558.3095 6.1 0.6182592 2.512411 1 -150.2017 9.989497
8 622.4032 5.1 0.6212878 4.021260 19 -144.5067 9.956234
9 548.5725 5.1 0.6182959 3.293309 9 -122.2372 9.984285
10 648.2789 5.0 0.6047939 2.824690 6 -187.9780 9.943322
11 630.1304 5.2 0.6158529 3.249813 2 -169.4295 9.945004
12 672.2095 4.7 0.6116022 2.242199 9 -169.6033 9.911122
13 583.0784 5.2 0.6197754 3.761984 22 -144.0123 9.953895
14 604.3973 4.6 0.6136522 4.301984 2 -163.0979 9.931259

Topic size - The top 3 topics by size are Topic 4, 12, and 2. These topics are generally positive or neutral which could be an indication that customers are generally satisfied with the restaurants. The 3 smallest topics are Topic 9, 7 and 1, which are all negative. This is good news for McDonald’s as it seems negative reviews are not as prevalent.

Mean token length - Topics 7, 6 and 3 have the largest average word lengths. Interestingly, it appears that topic 6 and 3 have their average skewed from a couple of large words appearing in the topic.

Topic coherence - According to the topic coherence score, the words in Topic 9, 5 and 13 make the most sense when put together. Topics 10, 3 and 2 have the lowest coherence scores and as pointed out in the visual analysis - they make the least sense.

Topic exclusivity - The topic exclusivity score refers to how distinct and easy to interpret each topic is. Topics 7, 9, and 2 have the highest scores and it is evident from the content of the topics that the meaning is easy to interpret, except for 2. Topics 3, 5, and 12 have the lowest exclusivity scores, with 5 contradicting it’s coherence score.

Highest quality topics Judging from the diagnostics, it appears that the highest quality topics are 9 and 7. This assumption is supported by the visual evaluation as when reading the words within the topics, it is also easy to determine the sentiment - which is negative.

Lowest quality topics The diagnostics indicate that the lowest quality topics are 3 and 10. Again, this assumption is supported by the visual evaluation - it is difficult to determine the meaning or sentiment behind each of these topics.

3.3 Recommendations

The topic modeling analysis revealed interesting insights which give McDonalds an idea on where to focus it’s efforts across restaurants.

The majority of it’s efforts should be focused on staff training and actually hiring quality staff. Perhaps strengthening the interview screening process or improving employee benefits could attract better quality staff. Some key issues which could be resolved with adequate training are listed below

1. Un-helpful and/or rude staff

2. Incorrect orders

3. Cold food items being served

4. Poor queue management