Text & Sentiment Analysis

Author

Danil Kimbaev

#####Setting Up#####
library(tidyverse)
library(tidytext)
library(topicmodels)
library(topicdoc)
library(wordcloud)
library(kableExtra)

hotel <- read_csv("hawaiian_hotel_reviews.csv")

1 Introduction

This document covers text & sentiment analysis of a Hawaiian hotel’s reviews as well as topic modelling analysis of McDonalds reviews from across the USA.

2 Hawaiian Hotel Reviews Text & Sentiment Analysis

2.1 Top 20 most common words

The top 20 words that appear in the hotel’s reviews are seen below. However without context, these words have little meaning.

data(stop_words)

hotel_counts <- hotel %>% 
  unnest_tokens(word, review, token = "words") %>% 
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  top_n(20)

ggplot(hotel_counts) +
  geom_col(mapping = aes(x = n, y = reorder(word, n), fill = "#0099F8")) +
  labs(y = NULL)+
  theme_light() +
  theme(legend.position = "none")

2.2 Top 10 positive & negative sentiment words

Performing sentiment analysis makes the picture slightly clearer. The top 10 most commonly occurring positive & negative words are visible in the table below. However, once again without context we cannot make an accurate assumption as a simple prefix such as “not” could change the sentiment.

sentiments <- get_sentiments("bing")

hotel_sentiments <- hotel %>% 
  unnest_tokens(word, review, token = "words") %>%
  anti_join(stop_words) %>%
  inner_join(sentiments)

table1 <- hotel_sentiments %>%
  filter(sentiment == "positive") %>%
  count(word, sort = TRUE) %>%
  top_n(10)

table2 <- hotel_sentiments %>%
  filter(sentiment == "negative") %>%
  count(word, sort = TRUE) %>%
  top_n(10)

knitr::kable(table1, 
             align = "lr",
             caption = "Top 10 Words Associated With Positivity",
             col.names = c("Word", "Count"),
             table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black") %>%
    column_spec(1, bold = TRUE, color = "#0099F8")
knitr::kable(table2, 
             align = "lr",
             caption = "Top 10 Words Associated With Negativity",
             col.names = c("Word", "Count"),
             table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black") %>%
    column_spec(1, bold = TRUE, color = "#0099F8")

Top 10 Words Associated With Positivity
Word	Count
nice	7274
clean	3574
beautiful	3560
friendly	2753
free	2563
recommend	2355
loved	2052
amazing	1940
helpful	1898
enjoyed	1867

Top 10 Words Associated With Negativity
Word	Count
expensive	2809
crowded	2450
bad	1147
complex	1011
pricey	835
noise	790
disappointed	769
hard	729
cheap	575
overpriced	572

2.3 Sentiment change over time

The chart below showcases how the positive & negative sentiment has changed over time. It is evident that there is far more positive sentiment rather than negative, however there is a larger decrease in the positive sentiment. This implies that customers are growing more dissatisfied with the hotel.

hotel_sentiments <- hotel %>% 
  unnest_tokens(word, review, token = "words") %>%
  anti_join(stop_words) %>%
  inner_join(sentiments)

hotel_sentiments <- mutate(hotel_sentiments, block = id%/%150)

hotel_blocks <- hotel_sentiments %>%
  group_by(block) %>%
  count(sentiment)

ggplot(hotel_blocks) +
  geom_col(mapping = aes(x = block, y = n), fill = "#0099F8") +
  facet_wrap(~ sentiment, nrow = 1) +
  ylab("# Sentiments") +
  xlab("") +
  ylab("Sentiment")

  theme_light()

2.4 Most commonly occuring bigrams

Below is a table featuring the most commonly occuring bigrams within the reviews. It is evident that customers are frequently mentioning land marks such as “Rainbow Tower” or “Waikiki Beach”. The hotel’s amenities are also commonly mentioned such as “Private Pool” or “Breakfast Buffet”

hotel_bigrams <- hotel %>% 
  unnest_tokens(bigram, review, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  unite(bigram, word1, word2, sep = " ")

table3 <- count(hotel_bigrams, bigram, sort = TRUE) %>%
  top_n(30)

knitr::kable(table3, 
             align = "lr",
             caption = "Top 30 Bigrams",
             col.names = c("Bigram", "Count"),
             table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black") %>%
    column_spec(1, bold = TRUE, color = "#0099F8")

Top 30 Bigrams
Bigram	Count
rainbow tower	3567
hawaiian village	2909
hilton hawaiian	2821
ocean view	2332
diamond head	2180
waikiki beach	1710
tapa tower	1625
ali'i tower	1583
front desk	1328
resort fee	992
walking distance	973
friday night	934
abc store	913
ala moana	894
kalia tower	894
hilton honors	714
ocean front	648
head tower	581
highly recommend	580
abc stores	539
super pool	517
minute walk	485
alii tower	476
tropics bar	458
customer service	449
partial ocean	437
private pool	424
north shore	422
breakfast buffet	412
moana shopping	390

2.5 Most commonly occuring trigrams

Looking at the most commonly occurring trigrams unveals more information. Once again, the majority of trigrams mention various landmarks such as “Diamond Head Tower” or “Hilton Hawaiian Village”. This in combination with trigrams such as “Easy Walking Distance” or “10 Minute Walk” potentially indicates that the hotel in a good location. However, the 3rd most common trigram “Partial Ocean View” seems slightly worrying, as it potentially indicates a pain point for customers.

hotel_trigrams <- hotel %>% 
  unnest_tokens(trigram, review, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word3 %in% stop_words$word) %>%
  unite(bigram, word1, word2, word3, sep = " ")

table4 <- count(hotel_trigrams, bigram, sort = TRUE) %>%
  top_n(30)

knitr::kable(table4, 
             align = "lr",
             caption = "Top 30 Bigrams",
             col.names = c("Bigram", "Count"),
             table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black") %>%
    column_spec(1, bold = TRUE, color = "#0099F8")

Top 30 Bigrams
Bigram	Count
hilton hawaiian village	2614
diamond head tower	575
partial ocean view	389
ala moana shopping	365
friday night fireworks	358
round table pizza	205
moana shopping centre	171
ala moana mall	147
front desk staff	144
10 minute walk	137
hawaiian village waikiki	136
15 minute walk	135
moana shopping center	126
wailana coffee house	121
day resort fee	115
village waikiki beach	109
waikiki beach resort	108
hawaiian village resort	95
rainbow tower ocean	90
2 double beds	89
king size bed	88
facing diamond head	85
tropics bar grill	80
hilton hawaiin village	79
diamond head view	75
daily resort fee	70
hawaii 5 0	70
30 resort fee	69
easy walking distance	69
ice cream shop	69

2.6 Summarising reviews containing various key terms

2.6.1 Lagoon reviews

Reviews contaning the word “Lagoon” contained valuable insight about the hotel & resort. Customers reported that staff were very friendly and helpful, as well as freebies & upgrades being common. The hotel has various towers in which holidaymakers stay in that have beautiful views & each with its own super pool. The hotel resort is huge & has everything a vacationer needs - they wouldn’t even have to leave the resort if they didn’t want to. The hotel sits on the ocean-side, however the beach is overcrowded. Lots of shops nearby with the ABC store nearby being the main choice for snacks and drinks.

lagoon_reviews <- filter(hotel, str_detect(review, regex("lagoon", ignore_case = TRUE)))

write_csv(lagoon_reviews, "lagoon_reviews.csv")

2.6.2 Rainbow tower reviews

Reviews containing the words “Rainbow Tower” were quite similar, but with new pieces of information. Again, the views from the towers were reported as beautiful, friendly staff & good rooms however nothing out of the ordinary. There are also many restaurants in the surrounding area & the hotel/resort grounds are beautiful. However, it was reported that everything is expensive & the hotel charges for every single thing.

rainbowtower_reviews <- filter(hotel, str_detect(review, regex("rainbow tower", ignore_case = TRUE)))

write_csv(rainbowtower_reviews, "rainbow_reviews.csv")

2.6.3 Al Moana shopping reviews

Reviews containing the words “Al Moana Shopping” uncovered even more information. The reviews contained similar information to what is reported in i) & ii) regarding views, service and rooms - however it is revealed that the hotel is close to a big shopping center called Al Moana. The shopping center contains an expansive food court as well as many stores including various designer stores. Reviewers mention that the street outside the hotel is very busy & hotel food/drinks are a “complete rip off”.

alamoanashopping_reviews <- filter(hotel, str_detect(review, regex("ala moana shopping", ignore_case = TRUE)))

write_csv(alamoanashopping_reviews, "ala_moana_shopping.csv")

2.7 Word clouds

Below are word clouds showcasing the most common positive & negative terms. The positive word cloud showcases words which appear over 1000 times whilst the negative word cloud showcases words which appear over 500 times.

Positive

hotel_sentiments2 <- hotel %>% 
  unnest_tokens(word, review, token = "words") %>% 
  anti_join(stop_words) %>%
  inner_join(sentiments) %>%
  count(word, sentiment, sort = TRUE)

hotel_pos_sentiments <- filter(hotel_sentiments2, sentiment == "positive")

wordcloud(hotel_pos_sentiments$word, 
          hotel_pos_sentiments$n, 
          min.freq = 1000, 
          colors = brewer.pal(8, "Dark2"))

Negative

hotel_neg_sentiments <- filter(hotel_sentiments2, sentiment == "negative")

wordcloud(hotel_neg_sentiments$word, 
          hotel_neg_sentiments$n, 
          min.freq = 500, 
          colors = brewer.pal(8, "Dark2"))

3 McDonalds Reviews Topic Modeling Analysis

Utilizing the Collapsed Gibbs Sampling method, an LDA algorithm was created to perform topic modeling analysis on McDonalds restaurant reviews. A grid of bar charts showcasing various potential topics within the reviews is seen below.

##a##
mcds <- read_csv("mcdonalds_reviews.csv")

data(stop_words)

mcds_count <- mcds %>% 
  unnest_tokens(word, review, token = "words") %>% 
  anti_join(stop_words) %>%
  count(id, word, sort = TRUE)
  
mcds_dtm <- cast_dtm(mcds_count, id, word, n)

##b##
mcds_lda <- LDA(mcds_dtm, method = "Gibbs", k = 14, control = list(seed = 1234))

mcds_lda_beta <- tidy(mcds_lda)

mcds_lda_top_terms <- mcds_lda_beta %>%
  group_by(topic) %>%
  slice_max(beta, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

##c - This includes the Visual evaluation and Numeric Evaluation. Not labelled to make report look cleaner##
mcds_lda_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 10 terms in each LDA topic", x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 6, scales = "free")

3.1 Visual evaluation

Below is a summary of each topic.

Topic 1 - Appears to be describing something got to do with the drive through.

Topic 2 - Something related to the manager being called to change a meal. Does not seem coherent.

Topic 3 - This topic describes an experience happening with the the staff. Does not seem coherent.

Topic 4 - Possibly describes the ice cream machine running 24 hours, night or day.

Topic 5 - Seems to be describing a drink order being wrong 2 times.

Topic 6 - Something related to homeless people around the restaurant. Potentially making eating unpleasant.

Topic 7 - This topic is negative. It is quite coherent and is describing bad customer service, with rude employees and a dirty premises.

Topic 8 - Describes the McDonalds breakfast menu. Does not provide anything actionable

Topic 9 - This topic entails long waiting times, queues and multiple people waiting.

Topic 10 - This topic does not seem coherent, and describes something to do with the location of the store.

Topic 11 - This topic seems positive. It appears to describe a clean premises, friendly staff.

Topic 12 - Appears to be about the kids area within the restaurant, and possibly the tables being clean.

Topic 13 - This topic seems negative, mentioning various menu items being cold.

Topic 14 - This topic seems to be centered around fast food being quick and cheap however poor quality.

3.2 Numeric evaluation

A table with various diagnostic figures is featured below. Using the figures in this table, the quality of each topic can be assessed.

topic_quality <- topic_diagnostics(mcds_lda, mcds_dtm)

knitr::kable(topic_quality, 
             align = "lr",
             caption = "Topic Diagnostics",
             table.attr = 'data-quarto-disable-processing = "false"') %>%
  kable_styling(full_width = T) %>%
    row_spec(0, bold = TRUE, color = "white", background = "black") %>%
    column_spec(1, bold = TRUE, color = "#0099F8")

Topic Diagnostics
topic_num	topic_size	mean_token_length	dist_from_corpus	tf_df_dist	doc_prominence	topic_coherence	topic_exclusivity
1	570.5098	4.0	0.6040340	4.617374	11	-144.1279	9.956960
2	663.8371	5.2	0.6142948	2.460084	8	-170.1142	9.971655
3	654.1112	5.9	0.6027599	2.612595	5	-171.5748	9.906536
4	676.0556	4.2	0.6160793	2.406061	7	-150.8710	9.957347
5	591.0666	4.6	0.6057051	3.801706	6	-142.9228	9.908269
6	610.0399	6.0	0.6147538	5.246863	4	-168.4202	9.938810
7	558.3095	6.1	0.6182592	2.512411	1	-150.2017	9.989497
8	622.4032	5.1	0.6212878	4.021260	19	-144.5067	9.956234
9	548.5725	5.1	0.6182959	3.293309	9	-122.2372	9.984285
10	648.2789	5.0	0.6047939	2.824690	6	-187.9780	9.943322
11	630.1304	5.2	0.6158529	3.249813	2	-169.4295	9.945004
12	672.2095	4.7	0.6116022	2.242199	9	-169.6033	9.911122
13	583.0784	5.2	0.6197754	3.761984	22	-144.0123	9.953895
14	604.3973	4.6	0.6136522	4.301984	2	-163.0979	9.931259

Topic size - The top 3 topics by size are Topic 4, 12, and 2. These topics are generally positive or neutral which could be an indication that customers are generally satisfied with the restaurants. The 3 smallest topics are Topic 9, 7 and 1, which are all negative. This is good news for McDonald’s as it seems negative reviews are not as prevalent.

Mean token length - Topics 7, 6 and 3 have the largest average word lengths. Interestingly, it appears that topic 6 and 3 have their average skewed from a couple of large words appearing in the topic.

Topic coherence - According to the topic coherence score, the words in Topic 9, 5 and 13 make the most sense when put together. Topics 10, 3 and 2 have the lowest coherence scores and as pointed out in the visual analysis - they make the least sense.

Topic exclusivity - The topic exclusivity score refers to how distinct and easy to interpret each topic is. Topics 7, 9, and 2 have the highest scores and it is evident from the content of the topics that the meaning is easy to interpret, except for 2. Topics 3, 5, and 12 have the lowest exclusivity scores, with 5 contradicting it’s coherence score.

Highest quality topics Judging from the diagnostics, it appears that the highest quality topics are 9 and 7. This assumption is supported by the visual evaluation as when reading the words within the topics, it is also easy to determine the sentiment - which is negative.

Lowest quality topics The diagnostics indicate that the lowest quality topics are 3 and 10. Again, this assumption is supported by the visual evaluation - it is difficult to determine the meaning or sentiment behind each of these topics.

3.3 Recommendations

The topic modeling analysis revealed interesting insights which give McDonalds an idea on where to focus it’s efforts across restaurants.

The majority of it’s efforts should be focused on staff training and actually hiring quality staff. Perhaps strengthening the interview screening process or improving employee benefits could attract better quality staff. Some key issues which could be resolved with adequate training are listed below

1. Un-helpful and/or rude staff

2. Incorrect orders

3. Cold food items being served

4. Poor queue management