Thinkful

Using a dataset from Kaggle I cleaned, processed, and created various models to identify the common words used in hotel reviews.  Using Logistic RegressionCV, Random Forest Classifier, and Gradient Boosting I created models that correctly identified positive and negative reviews with an 88% accuracy.

Edit for more information:
My goal for this project was to create something that would be useful for a business and also to test out using text instead of just numbers for my analysis.  First, I did my data exploration and created many visualizations to get an idea of what the data looked like.  I noticed that I had over 800 different cities in the United States represented by over 1300 different hotels.  This was a good enough mix to get a valid look at reviews.

Next, I needed to decide how I was going to categorize these reviews.  They were on a 1-5 review scale (1 being a negative experience and 5 being very positive) so I went ahead and put the 4s and the 5s in the "Positive" category and the 1,2,3 into the "Negative" category.  If I had more datapoints (this included 10,000 rows) I would have liked to classify the reviews based on their number but based on the data that I had I decided to make it a binary classification.

After all of that I needed to get my data ready to analyze.  I preprocessed the text data by removing the punctuations, tokenizing the text, removing the stopwords, and then getting the words down to their Lemmas.  For visualization purposes I created a "Positive" and "Negative" dataframe and then used wordclouds (the larger a word the more it appears) to show the common language in positive and negative reviews.  I also printed out the number of times each word appears in the text.

Finally, I went ahead and tried Bag of Words and TFIDF to prepare the words for Machine Learning.  I ran them through Logistic Regression, Logistic RegressionCV, Random Forest Classifier, and Gradient Boosting Classifier.  After all the hypertuning of the models my top model was the Bag of Words with Logistic RegressionCV at 88%!  

It was very interesting to work with text and using NLP and I would love to continue working on projects like this!

Analysis of hotel reviews with NLP to classify with ML