Assignment
Airbnb is an online vacation rental marketplace servicing a community of hosts and travellers. The diagram below shows the process of how Airbnb started with two individuals who could not pay for rent in 2007 to starting a company that reached US$10 billion valuation by 2014.In 2020, Airbnb went public with valuation of up to US$47 million. valuation of up to US$47 million
According to Airbnb, Airbnb has millions of listings in over 220 counties and regions across 100,000 cities. The data generated provides rich information, including structured data e.g. price and location, as well as unstructured data e.g. reviews and listing descriptions. While there are statistical and analytic tools available to derive insights using these data, these tools are often subscription-based and require technical knowledge, which may not be available or accessible to everyone. Hence, this project aims to develop an interface which is concise, interactive, and user-friendly using R Shiny. With this interface, data-based decisions can be made from the interactive GUI. The R Shiny app will cover exploratory data analysis, confirmatory data analysis, text mining, as well as predictive analysis.
This assignment is sub-module of our final Shiny-based Visual Analytics Application (Shiny-VAA). In particular, a focus on text mining utilising various R packages will be presented. The process is shown below:
Our application can be used from both the perspective of hosts and guests.
Hosts: In 2014, Airbnb launched the Superhost programme to reward hosts with outstanding hospitality. As a Superhost, one will have better earnings, more visibility, and are able to earn exclusive rewards such as increased earnings compared to regular hosts. To become a Superhost, these are the criteria to be met: - 4.8 or higher overall rating based on reviews - Completed at least 10 stays in the past year or 100 nights over at least 3 completed stays - Less than 1% cancellation rate, not including extenuating circumstances - Responds to 90% of new messages within 24 hours
Guests: With over 60,000 members and 6,000 properties listed on Airbnb website, a dilemma on which is the right space might be of concern to users. Various modules in our dashboard will allow both types of users to analyse Airbnb data according to their needs.
InsideAirbnb provides tools and data for users to explore Airbnb. We will be using the following files: - listing.csv.gz: This dataset consists of 74 variables and 4256 data points.
- reviews.csv.gz: This dataset provides 6 variables and 52368 data points.
While the team has decided to use the latest set of data compiled on 27 January 2021, this report uses data compiled on 29 December 2020 for completeness.
Conducting literature review on how the analysis were performed before. The focus should be on identifying gaps whereby interactive web approach and visual analytics techniques can be used to enhance user experience on using the analysis techniques.
Airbnb data has been widely used for text mining in tools like Python and R. In Python, (Natural Language Processing Toolkit)[https://www.nltk.org/] has easy-to-use interfaces to over 50 corpora and lexical resource, as well as a wide range of text processing libraries for tokenisation, stemming, classification etc. Similarly, R has extensive libraries such as tidyverse and Shiny which allows for text mining and building of interactive dashboards.
Zhang (2019) used text mining approaches including content analysis and topic modelling (Latent Dirichlet Allocation (LDA) method) to examine over 1 million Airbnb reviews across 50,933 listings in the United States of America (USA). Kiatkawsin, Sutherland & Kim (2020) also used LDA method to compare reviews between Hong Kong and Singapore. However, these articles do not provide visualiation of the methods used and are not interactive.
Kim’s Shiny Airbnb App provided dashboard which is interactive for Exploratory Data Analysis (EDA), but left out reviews. [Ankit Pandey] (https://github.com/ankit2web/Twitter-Sentiment-Analysis-using-R-Shiny-WebApp) provided a more comprehensive text analytics dashboard using wordcloud and polarity of sentiments, but does not provide much interactivity.
To solve the above gaps, the next section outlines the steps:
Extracting, wrangling and preparing the input data required to perform the analysis. The focus should be on exploring appropriate tidyverse methods
runtime:shiny was added to allow dynamic documentation. {r} part of the code chunk can be used to specify elements and subsequently rendered into different format. echo=TRUE is set to allow printing of code chunk when rendered into a different file format. More details can be found at R Markdown Documentation.
To install multiple packages and load the libraries, run the following codes chunk:
packages <- c("tidyverse","sf","tmap","crosstalk","leaflet","RColorBrewer","ggplot2","rgdal", "rgeos", "raster", "maptools","tmaptools","shiny","tidytext","wordcloud","wordcloud2","tm","ggthemes","igraph","ggmap","DT","reshape2","ggraph","topicmodels","tidytext","topicmodels","quanteda","tm","RColorBrewer","DataExplorer")
for (p in packages){
if (!require(p,character.only=T)){
install.packages(p)
}
library(p, character.only=T)
}
Use the read_csv() function to determine the path to the file to read. It prints out a column specification that gives the name and type of each column. As the are unnecessary columns, select() function is use to retain only the columns used in subsequent analysis. - review file contains 52367 observations with 6 variables; 2 columns (listing_id and comments) are retained. - listings file contains 4255 observations with 74 variables; 33 columns are retained.
reviews <- read_csv("C:/Users/joeyc/blog/_posts/2021-03-29-assignment/reviews.csv")%>%
dplyr::select(listing_id,comments)
listings <- read_csv("C:/Users/joeyc/blog/_posts/2021-03-29-assignment/listings.csv") %>%
rename(listing_id=id) %>%
dplyr::select(-c(listing_url, scrape_id, last_scraped, name, picture_url,host_url, host_about,host_thumbnail_url, host_picture_url, host_listings_count, host_verifications,calendar_updated,first_review,last_review,license,neighborhood_overview,description,host_total_listings_count,host_has_profile_pic,availability_30,availability_60,availability_90,availability_365,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,number_of_reviews_ltm,number_of_reviews_l30d,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_last_scraped,has_availability,instant_bookable))
right_join() is used to merge the listings and review files so that all rows from listings will be returned.
data <- right_join(reviews,listings,by="listing_id")
To write to CSV for future usage, run the following code without hashtag(#).
#write.csv(data,"data.csv")
glimpse(data)
Rows: 54,074
Columns: 34
$ listing_id <dbl> 49091, 50646, 50646, 50646, 506~
$ comments <chr> "Fran was absolutely gracious a~
$ host_id <dbl> 266763, 227796, 227796, 227796,~
$ host_name <chr> "Francesca", "Sujatha", "Sujath~
$ host_since <date> 2010-10-20, 2010-09-08, 2010-0~
$ host_location <chr> "Singapore", "Singapore, Singap~
$ host_response_time <chr> "within a few hours", "a few da~
$ host_response_rate <chr> "100%", "0%", "0%", "0%", "0%",~
$ host_acceptance_rate <chr> "N/A", "N/A", "N/A", "N/A", "N/~
$ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALSE, FAL~
$ host_neighbourhood <chr> "Woodlands", "Bukit Timah", "Bu~
$ host_identity_verified <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, T~
$ neighbourhood <chr> NA, "Singapore, Singapore", "Si~
$ neighbourhood_cleansed <chr> "Woodlands", "Bukit Timah", "Bu~
$ neighbourhood_group_cleansed <chr> "North Region", "Central Region~
$ latitude <dbl> 1.44255, 1.33235, 1.33235, 1.33~
$ longitude <dbl> 103.7958, 103.7852, 103.7852, 1~
$ property_type <chr> "Private room in apartment", "P~
$ room_type <chr> "Private room", "Private room",~
$ accommodates <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, NA, NA,~
$ bathrooms_text <chr> "1 bath", "1 bath", "1 bath", "~
$ bedrooms <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ beds <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ amenities <chr> "[\"Washer\", \"Elevator\", \"L~
$ price <chr> "$80.00", "$80.00", "$80.00", "~
$ number_of_reviews <dbl> 1, 18, 18, 18, 18, 18, 18, 18, ~
$ review_scores_rating <dbl> 94, 91, 91, 91, 91, 91, 91, 91,~
$ review_scores_accuracy <dbl> 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, ~
$ review_scores_cleanliness <dbl> 10, 10, 10, 10, 10, 10, 10, 10,~
$ review_scores_checkin <dbl> 10, 10, 10, 10, 10, 10, 10, 10,~
$ review_scores_communication <dbl> 10, 10, 10, 10, 10, 10, 10, 10,~
$ review_scores_location <dbl> 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
$ review_scores_value <dbl> 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
glimpse() does not present the data in a tabular format, hence datatable and kable packages were considered.However, - datatable() does not work well with the extensions of FixedColumns, FixedHeader and Scoller when coupled with Shiny. Hence, these specific functionalities are excluded. - kable() is not up to date with the current version of R and was not used.