#####Setting Up#####
library(tidyverse)
library(tidytext)
library(topicmodels)
library(topicdoc)
library(wordcloud)
library(kableExtra)
<- read_csv("hawaiian_hotel_reviews.csv") hotel
Text & Sentiment Analysis
1 Introduction
This document covers text & sentiment analysis of a Hawaiian hotel’s reviews as well as topic modelling analysis of McDonalds reviews from across the USA.
2 Hawaiian Hotel Reviews Text & Sentiment Analysis
2.1 Top 20 most common words
The top 20 words that appear in the hotel’s reviews are seen below. However without context, these words have little meaning.
data(stop_words)
<- hotel %>%
hotel_counts unnest_tokens(word, review, token = "words") %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
top_n(20)
ggplot(hotel_counts) +
geom_col(mapping = aes(x = n, y = reorder(word, n), fill = "#0099F8")) +
labs(y = NULL)+
theme_light() +
theme(legend.position = "none")
2.2 Top 10 positive & negative sentiment words
Performing sentiment analysis makes the picture slightly clearer. The top 10 most commonly occurring positive & negative words are visible in the table below. However, once again without context we cannot make an accurate assumption as a simple prefix such as “not” could change the sentiment.
<- get_sentiments("bing")
sentiments
<- hotel %>%
hotel_sentiments unnest_tokens(word, review, token = "words") %>%
anti_join(stop_words) %>%
inner_join(sentiments)
<- hotel_sentiments %>%
table1 filter(sentiment == "positive") %>%
count(word, sort = TRUE) %>%
top_n(10)
<- hotel_sentiments %>%
table2 filter(sentiment == "negative") %>%
count(word, sort = TRUE) %>%
top_n(10)
::kable(table1,
knitralign = "lr",
caption = "Top 10 Words Associated With Positivity",
col.names = c("Word", "Count"),
table.attr = 'data-quarto-disable-processing = "true"') %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE, color = "white", background = "black") %>%
column_spec(1, bold = TRUE, color = "#0099F8")
::kable(table2,
knitralign = "lr",
caption = "Top 10 Words Associated With Negativity",
col.names = c("Word", "Count"),
table.attr = 'data-quarto-disable-processing = "true"') %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE, color = "white", background = "black") %>%
column_spec(1, bold = TRUE, color = "#0099F8")
Word | Count |
---|---|
nice | 7274 |
clean | 3574 |
beautiful | 3560 |
friendly | 2753 |
free | 2563 |
recommend | 2355 |
loved | 2052 |
amazing | 1940 |
helpful | 1898 |
enjoyed | 1867 |
Word | Count |
---|---|
expensive | 2809 |
crowded | 2450 |
bad | 1147 |
complex | 1011 |
pricey | 835 |
noise | 790 |
disappointed | 769 |
hard | 729 |
cheap | 575 |
overpriced | 572 |
2.3 Sentiment change over time
The chart below showcases how the positive & negative sentiment has changed over time. It is evident that there is far more positive sentiment rather than negative, however there is a larger decrease in the positive sentiment. This implies that customers are growing more dissatisfied with the hotel.
<- hotel %>%
hotel_sentiments unnest_tokens(word, review, token = "words") %>%
anti_join(stop_words) %>%
inner_join(sentiments)
<- mutate(hotel_sentiments, block = id%/%150)
hotel_sentiments
<- hotel_sentiments %>%
hotel_blocks group_by(block) %>%
count(sentiment)
ggplot(hotel_blocks) +
geom_col(mapping = aes(x = block, y = n), fill = "#0099F8") +
facet_wrap(~ sentiment, nrow = 1) +
ylab("# Sentiments") +
xlab("") +
ylab("Sentiment")
theme_light()
2.4 Most commonly occuring bigrams
Below is a table featuring the most commonly occuring bigrams within the reviews. It is evident that customers are frequently mentioning land marks such as “Rainbow Tower” or “Waikiki Beach”. The hotel’s amenities are also commonly mentioned such as “Private Pool” or “Breakfast Buffet”
<- hotel %>%
hotel_bigrams unnest_tokens(bigram, review, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(bigram, word1, word2, sep = " ")
<- count(hotel_bigrams, bigram, sort = TRUE) %>%
table3 top_n(30)
::kable(table3,
knitralign = "lr",
caption = "Top 30 Bigrams",
col.names = c("Bigram", "Count"),
table.attr = 'data-quarto-disable-processing = "true"') %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE, color = "white", background = "black") %>%
column_spec(1, bold = TRUE, color = "#0099F8")
Bigram | Count |
---|---|
rainbow tower | 3567 |
hawaiian village | 2909 |
hilton hawaiian | 2821 |
ocean view | 2332 |
diamond head | 2180 |
waikiki beach | 1710 |
tapa tower | 1625 |
ali'i tower | 1583 |
front desk | 1328 |
resort fee | 992 |
walking distance | 973 |
friday night | 934 |
abc store | 913 |
ala moana | 894 |
kalia tower | 894 |
hilton honors | 714 |
ocean front | 648 |
head tower | 581 |
highly recommend | 580 |
abc stores | 539 |
super pool | 517 |
minute walk | 485 |
alii tower | 476 |
tropics bar | 458 |
customer service | 449 |
partial ocean | 437 |
private pool | 424 |
north shore | 422 |
breakfast buffet | 412 |
moana shopping | 390 |
2.5 Most commonly occuring trigrams
Looking at the most commonly occurring trigrams unveals more information. Once again, the majority of trigrams mention various landmarks such as “Diamond Head Tower” or “Hilton Hawaiian Village”. This in combination with trigrams such as “Easy Walking Distance” or “10 Minute Walk” potentially indicates that the hotel in a good location. However, the 3rd most common trigram “Partial Ocean View” seems slightly worrying, as it potentially indicates a pain point for customers.
<- hotel %>%
hotel_trigrams unnest_tokens(trigram, review, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word) %>%
unite(bigram, word1, word2, word3, sep = " ")
<- count(hotel_trigrams, bigram, sort = TRUE) %>%
table4 top_n(30)
::kable(table4,
knitralign = "lr",
caption = "Top 30 Bigrams",
col.names = c("Bigram", "Count"),
table.attr = 'data-quarto-disable-processing = "true"') %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE, color = "white", background = "black") %>%
column_spec(1, bold = TRUE, color = "#0099F8")
Bigram | Count |
---|---|
hilton hawaiian village | 2614 |
diamond head tower | 575 |
partial ocean view | 389 |
ala moana shopping | 365 |
friday night fireworks | 358 |
round table pizza | 205 |
moana shopping centre | 171 |
ala moana mall | 147 |
front desk staff | 144 |
10 minute walk | 137 |
hawaiian village waikiki | 136 |
15 minute walk | 135 |
moana shopping center | 126 |
wailana coffee house | 121 |
day resort fee | 115 |
village waikiki beach | 109 |
waikiki beach resort | 108 |
hawaiian village resort | 95 |
rainbow tower ocean | 90 |
2 double beds | 89 |
king size bed | 88 |
facing diamond head | 85 |
tropics bar grill | 80 |
hilton hawaiin village | 79 |
diamond head view | 75 |
daily resort fee | 70 |
hawaii 5 0 | 70 |
30 resort fee | 69 |
easy walking distance | 69 |
ice cream shop | 69 |
2.6 Summarising reviews containing various key terms
2.6.1 Lagoon reviews
Reviews contaning the word “Lagoon” contained valuable insight about the hotel & resort. Customers reported that staff were very friendly and helpful, as well as freebies & upgrades being common. The hotel has various towers in which holidaymakers stay in that have beautiful views & each with its own super pool. The hotel resort is huge & has everything a vacationer needs - they wouldn’t even have to leave the resort if they didn’t want to. The hotel sits on the ocean-side, however the beach is overcrowded. Lots of shops nearby with the ABC store nearby being the main choice for snacks and drinks.
<- filter(hotel, str_detect(review, regex("lagoon", ignore_case = TRUE)))
lagoon_reviews
write_csv(lagoon_reviews, "lagoon_reviews.csv")
2.6.2 Rainbow tower reviews
Reviews containing the words “Rainbow Tower” were quite similar, but with new pieces of information. Again, the views from the towers were reported as beautiful, friendly staff & good rooms however nothing out of the ordinary. There are also many restaurants in the surrounding area & the hotel/resort grounds are beautiful. However, it was reported that everything is expensive & the hotel charges for every single thing.
<- filter(hotel, str_detect(review, regex("rainbow tower", ignore_case = TRUE)))
rainbowtower_reviews
write_csv(rainbowtower_reviews, "rainbow_reviews.csv")
2.6.3 Al Moana shopping reviews
Reviews containing the words “Al Moana Shopping” uncovered even more information. The reviews contained similar information to what is reported in i) & ii) regarding views, service and rooms - however it is revealed that the hotel is close to a big shopping center called Al Moana. The shopping center contains an expansive food court as well as many stores including various designer stores. Reviewers mention that the street outside the hotel is very busy & hotel food/drinks are a “complete rip off”.
<- filter(hotel, str_detect(review, regex("ala moana shopping", ignore_case = TRUE)))
alamoanashopping_reviews
write_csv(alamoanashopping_reviews, "ala_moana_shopping.csv")
2.7 Word clouds
Below are word clouds showcasing the most common positive & negative terms. The positive word cloud showcases words which appear over 1000 times whilst the negative word cloud showcases words which appear over 500 times.
Positive
<- hotel %>%
hotel_sentiments2 unnest_tokens(word, review, token = "words") %>%
anti_join(stop_words) %>%
inner_join(sentiments) %>%
count(word, sentiment, sort = TRUE)
<- filter(hotel_sentiments2, sentiment == "positive")
hotel_pos_sentiments
wordcloud(hotel_pos_sentiments$word,
$n,
hotel_pos_sentimentsmin.freq = 1000,
colors = brewer.pal(8, "Dark2"))
Negative
<- filter(hotel_sentiments2, sentiment == "negative")
hotel_neg_sentiments
wordcloud(hotel_neg_sentiments$word,
$n,
hotel_neg_sentimentsmin.freq = 500,
colors = brewer.pal(8, "Dark2"))
3 McDonalds Reviews Topic Modeling Analysis
Utilizing the Collapsed Gibbs Sampling method, an LDA algorithm was created to perform topic modeling analysis on McDonalds restaurant reviews. A grid of bar charts showcasing various potential topics within the reviews is seen below.
##a##
<- read_csv("mcdonalds_reviews.csv")
mcds
data(stop_words)
<- mcds %>%
mcds_count unnest_tokens(word, review, token = "words") %>%
anti_join(stop_words) %>%
count(id, word, sort = TRUE)
<- cast_dtm(mcds_count, id, word, n)
mcds_dtm
##b##
<- LDA(mcds_dtm, method = "Gibbs", k = 14, control = list(seed = 1234))
mcds_lda
<- tidy(mcds_lda)
mcds_lda_beta
<- mcds_lda_beta %>%
mcds_lda_top_terms group_by(topic) %>%
slice_max(beta, n = 10, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
##c - This includes the Visual evaluation and Numeric Evaluation. Not labelled to make report look cleaner##
%>%
mcds_lda_top_terms mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 10 terms in each LDA topic", x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 6, scales = "free")
3.1 Visual evaluation
Below is a summary of each topic.
Topic 1 - Appears to be describing something got to do with the drive through.
Topic 2 - Something related to the manager being called to change a meal. Does not seem coherent.
Topic 3 - This topic describes an experience happening with the the staff. Does not seem coherent.
Topic 4 - Possibly describes the ice cream machine running 24 hours, night or day.
Topic 5 - Seems to be describing a drink order being wrong 2 times.
Topic 6 - Something related to homeless people around the restaurant. Potentially making eating unpleasant.
Topic 7 - This topic is negative. It is quite coherent and is describing bad customer service, with rude employees and a dirty premises.
Topic 8 - Describes the McDonalds breakfast menu. Does not provide anything actionable
Topic 9 - This topic entails long waiting times, queues and multiple people waiting.
Topic 10 - This topic does not seem coherent, and describes something to do with the location of the store.
Topic 11 - This topic seems positive. It appears to describe a clean premises, friendly staff.
Topic 12 - Appears to be about the kids area within the restaurant, and possibly the tables being clean.
Topic 13 - This topic seems negative, mentioning various menu items being cold.
Topic 14 - This topic seems to be centered around fast food being quick and cheap however poor quality.
3.2 Numeric evaluation
A table with various diagnostic figures is featured below. Using the figures in this table, the quality of each topic can be assessed.
<- topic_diagnostics(mcds_lda, mcds_dtm)
topic_quality
::kable(topic_quality,
knitralign = "lr",
caption = "Topic Diagnostics",
table.attr = 'data-quarto-disable-processing = "false"') %>%
kable_styling(full_width = T) %>%
row_spec(0, bold = TRUE, color = "white", background = "black") %>%
column_spec(1, bold = TRUE, color = "#0099F8")
topic_num | topic_size | mean_token_length | dist_from_corpus | tf_df_dist | doc_prominence | topic_coherence | topic_exclusivity |
---|---|---|---|---|---|---|---|
1 | 570.5098 | 4.0 | 0.6040340 | 4.617374 | 11 | -144.1279 | 9.956960 |
2 | 663.8371 | 5.2 | 0.6142948 | 2.460084 | 8 | -170.1142 | 9.971655 |
3 | 654.1112 | 5.9 | 0.6027599 | 2.612595 | 5 | -171.5748 | 9.906536 |
4 | 676.0556 | 4.2 | 0.6160793 | 2.406061 | 7 | -150.8710 | 9.957347 |
5 | 591.0666 | 4.6 | 0.6057051 | 3.801706 | 6 | -142.9228 | 9.908269 |
6 | 610.0399 | 6.0 | 0.6147538 | 5.246863 | 4 | -168.4202 | 9.938810 |
7 | 558.3095 | 6.1 | 0.6182592 | 2.512411 | 1 | -150.2017 | 9.989497 |
8 | 622.4032 | 5.1 | 0.6212878 | 4.021260 | 19 | -144.5067 | 9.956234 |
9 | 548.5725 | 5.1 | 0.6182959 | 3.293309 | 9 | -122.2372 | 9.984285 |
10 | 648.2789 | 5.0 | 0.6047939 | 2.824690 | 6 | -187.9780 | 9.943322 |
11 | 630.1304 | 5.2 | 0.6158529 | 3.249813 | 2 | -169.4295 | 9.945004 |
12 | 672.2095 | 4.7 | 0.6116022 | 2.242199 | 9 | -169.6033 | 9.911122 |
13 | 583.0784 | 5.2 | 0.6197754 | 3.761984 | 22 | -144.0123 | 9.953895 |
14 | 604.3973 | 4.6 | 0.6136522 | 4.301984 | 2 | -163.0979 | 9.931259 |
Topic size - The top 3 topics by size are Topic 4, 12, and 2. These topics are generally positive or neutral which could be an indication that customers are generally satisfied with the restaurants. The 3 smallest topics are Topic 9, 7 and 1, which are all negative. This is good news for McDonald’s as it seems negative reviews are not as prevalent.
Mean token length - Topics 7, 6 and 3 have the largest average word lengths. Interestingly, it appears that topic 6 and 3 have their average skewed from a couple of large words appearing in the topic.
Topic coherence - According to the topic coherence score, the words in Topic 9, 5 and 13 make the most sense when put together. Topics 10, 3 and 2 have the lowest coherence scores and as pointed out in the visual analysis - they make the least sense.
Topic exclusivity - The topic exclusivity score refers to how distinct and easy to interpret each topic is. Topics 7, 9, and 2 have the highest scores and it is evident from the content of the topics that the meaning is easy to interpret, except for 2. Topics 3, 5, and 12 have the lowest exclusivity scores, with 5 contradicting it’s coherence score.
Highest quality topics Judging from the diagnostics, it appears that the highest quality topics are 9 and 7. This assumption is supported by the visual evaluation as when reading the words within the topics, it is also easy to determine the sentiment - which is negative.
Lowest quality topics The diagnostics indicate that the lowest quality topics are 3 and 10. Again, this assumption is supported by the visual evaluation - it is difficult to determine the meaning or sentiment behind each of these topics.
3.3 Recommendations
The topic modeling analysis revealed interesting insights which give McDonalds an idea on where to focus it’s efforts across restaurants.
The majority of it’s efforts should be focused on staff training and actually hiring quality staff. Perhaps strengthening the interview screening process or improving employee benefits could attract better quality staff. Some key issues which could be resolved with adequate training are listed below
1. Un-helpful and/or rude staff
2. Incorrect orders
3. Cold food items being served
4. Poor queue management