Predicting Yelp Reviews using BERT

Katie Gu
Chatbots Life
Published in
11 min readMay 14, 2020

--

Problem Statement and Background

The problem we are trying to solve in our project is given a set of Yelp reviews, classify the reviews into one of five star categories based on the text of the review. The data consists of text reviews of different businesses from Yelp. The reviews are not cleaned; they may have capital letters, punctuation, and new lines. Furthermore, some of the reviews contain non-ASCII characters, such as non-Latin characters or accented characters. We also noticed that there were reviews with misspelled words and non-English reviews as well. For our success measures, we decided to use the mean absolute error and accuracy on a validation set. This project may be useful to businesses that heavily rely upon Yelp to attract customers, such as restaurants. They can use the results of our project to predict the ratings of reviews and keep only the reviews with good ratings. Our project may also be of use to customers who use Yelp to decide which businesses to frequent. Customers may use the predicted ratings to help them decide which businesses are worth going to.

Trending Chatbots Articles:

1. Crawl Twitter Data using 30 Lines of Python Code

2. A Conversational UI Maturity Model: a guide to take your bot to the next level

3. Designing a chatbot for an improved customer experience

4. Chat bots — A Conversational AI

One of the resources we consulted to help us design our model architecture was the article “BERT Explained: State of the Art Language Model for NLP” by Rani Horev, published in Towards Data Science. This article helped us decide to use BERT by explaining the advantages of BERT and gave some useful tips for training, such as using more training data. Another resource we consulted was the article “BERT, RoBERTa, DistilBERT, XLNet-Which One to Use?” by Suleiman Khan, also published in Towards Data Science. This article listed the pros and cons of each pretrained model architecture with regards to training time and performance and compared the test set results, which helped us find the right balance between the two factors. To familiarize ourselves with the deep learning development process, we read “A Hands-On Guide To Text Classification With Transformer Models (XLNet, BERT, XLM, RoBERTa)” by Thilina Rajapakse, published in Towards Data Science. This article gave us an overview of the different steps: data cleaning and preprocessing, model architecture and loss function design, training and validation, and testing.

Approach

To begin preprocessing the data, we removed accents/special characters, punctuation, new lines and converted review text to lowercase. Before, we tried to lemmatize the data to link words with similar meaning together. However, this process took way too long and didn’t significantly affect the model accuracy, so we skipped it for the final model. We then corrected the misspellings for uncommon words, which are words that appear less than 2 times. We tokenized each review and added at CLS token before every review, in order to satisfy the format for the BERT model input. We also created masks and segments in order to feed the data into BERT. We decided to select an equal amount of 1-star, 2-star, 3-star, 4-star, and 5-star reviews for our training data, in order to lower the bias caused by the high frequency of a single star type.

Furthermore, we wanted to utilize more training data, so we webscraped about 12,175 more Yelp reviews and ratings and added them to our training data set. We also randomized the data by shuffling the reviews before training. We iterated over different sections of the training data, and chunked each of those sections in order to speed up the training process and fix memory issues on the virtual machine.

Description of Baseline Model:

The baseline architecture of our model is largely based on a BERT transformer model. We chose to use a BERT transformer model over RNN or LSTM models because of the use of self-attention and cross-attention. The BERT transformer model uses information from neighboring words to determine the encoding of the current word, which is useful because the sentiment of a word largely depends on its context. The BERT transformer model is also significantly more efficient than RNN or LSTM models; whereas encoding a sentence takes O(N) for an RNN, encoding is O(1) for a transformer based model. Since our task is a classification task, we chose to use the BERT model as opposed to a generative model.

Specifically, our baseline architecture consists of the BERT transformer encoder, a dropout layer with dropout probability of 0.5, a linear layer, and a softmax layer to output probabilities. Since this is a multiclass classification problem, we used sparse categorical cross-entropy as our loss, and we used an ADAM optimizer. Our learning rate was 6 10–5.

Graphical Representation of Baseline:

Description of Final Model and Evolutionary Process

After testing out our baseline model, we noticed that our average star error wasn’t very good. We then decided to try a custom loss that weighted average star error and accuracy equally. The loss we tried was the sum of the absolute difference between our predicted ratings and the actual ratings and the sparse categorical cross entropy loss. However, this loss didn’t improve our average star error by much, so we went back to using sparse categorical cross entropy loss.

We also experimented with learning rates: we found that 5 10–5 worked the best for us. What made the most dramatic improvement for us was adding two more linear layers after the first. We decided to do this because one linear layer was not deep enough for us to generate accurate probabilities, as our text was very varied and complex. We tried different activation functions between the linear layers and found ReLU worked the best. Our final architecture was the BERT encoder, a dropout layer with dropout probability of 0.5, two linear-ReLU layers, and a final linear layer followed by a softmax.

Graphical Representation of Final Model:

Results

For our models, we utilized 10,000 reviews, with 80% of the data for the training set and 20% of the data for the validation set. Our data contained the given Yelp reviews and the new Yelp reviews we webscraped. As mentioned before, we randomly selected an equal amount of reviews of each star type for our training data, in order to lower the bias caused by a single, very frequent star type.

Baseline Model:

This plot demonstrates the training loss of 5 different data chunks over 10 epochs. We can see that as the model trains on more chunks, the training loss decreases similar to a negative exponential graph. It was surprising to us how the baseline quickly overfit to the training data.

This plot to the left demonstrates the training accuracy of 5 different data chunks over 10 epochs. As the model trains on more chunks, the training accuracy increases similar to a negative log function. We can see that the baseline model overfits on the training data after about 5 epochs as well.

The plot to the left demonstrates the validation loss of 5 different data chunks over 10 epochs. The loss seems to slightly increase and then decrease as the model sees more chunks and runs on more epochs.

This plot demonstrates the validation accuracy of 5 different data chunks over 10 epochs. The accuracy seems to decrease as the model sees more chunks and runs for more epochs.

This plot visualizes the Average Star Error and the Exact Match metrics on the Challenge datasets 3, 5, 6, and 8. The model seems to perform the best on dataset 8, since it has the lowest average star error and highest exact match value. The model doesn’t perform well on dataset 6, since it has a high average star error and low exact match value.

Final Model

This plot demonstrates the training loss of 5 different data chunks over 6 epochs. The loss decreases as the model runs for more epochs and the model trains on more chunks. The final model doesn’t overfit to the training data as the baseline model does.

The plot below demonstrates the training accuracy of 5 different data chunks over 6 epochs. The accuracy increases as the model runs for more epochs and sees more chunks. We see that the final model doesn’t overfit to the training data as the baseline model does, since the training accuracies are better constrained.

This plot demonstrates the average star error of 5 different training data chunks over 6 epochs. We see the average star error decrease significantly over the epochs and the chunks.

The plot below demonstrates the validation loss of 5 different data chunks over 6 epochs. The loss seems to surprisingly increase as the model runs for more epochs. This could be due to the model seeing new data.

This plot below demonstrates the validation accuracy of 5 different data chunks over 6 epochs. The accuracy increases as the model evaluates on more data chunks and increases only slightly with more data chunks.

This plot below demonstrates the average star error of 5 different validation data chunks over 6 epochs. We see the average star error decrease over the epochs. The average star error for the final model is lower than that of the baseline model.

This table compares the metrics between the baseline and final models on the validation dataset.

In conclusion, we see that the final model performs better on the validation set, since it has a higher accuracy, lower loss, and lower average star error. The final model also performed better overall on the released challenge datasets than the baseline model did.

Tools

To clean the data, we used Pandas and Numpy. For tokenization and NLP parsing, we used NLTK and spaCy. We used the Pyspellchecker library to help fix misspellings in the reviews. We used Octoparse for web-scraping Yelp, and wrote the code for our BERT model using Tensorflow and the Keras API.

Lessons Learned
This project mainly taught us how to deal with a problem in a research oriented manner while dealing with real world constraints. On the infrastructure side, we were initially using Google Colab and their free GPU’s, but we soon had to move to Google Cloud Platform due to GPU over usage. Even on Google Colab, we were only granted access to 1 GPU, so we had to improvise and use only the first 300 words of each review (as opposed to 512, our original plan, which was the BERT limit). It was equally important to set training-set size and batch size to not use too much memory yet still get good results (by training on multiple smaller training sets).

We also learned that the way you handle your data can improve the model a lot, as the Yelp data was skewed toward 1’s and 5’s, causing the model to predict mostly those values. We had good training and validation accuracy, but running on the test challenges showed us that our model wasn’t guessing 2, 3, and 4 nearly enough so we decided to split up the data into 20% of each review type (so they were equally distributed), and while this lowered our training accuracy a bit it increased our test performance a lot (for example, from 0% on the challenge 5 dataset to around 40–50%).

When designing the model, we had many ideas that we wanted to test out. As we were working, we realized the best way to approach this was to come up with a syntax-free and bare bones model that we could all fork and test architecture/parameter changes. Separately, we divided up the ideas and figured out which architecture decisions didn’t really matter (batch size, initialization type, choice of activation function), and found values of learning rate and layer size that seemed to do well. We learned to be flexible when our innovations didn’t work out and to not spend too much time on an architecture if it didn’t show big improvements after about 10 epochs.

If we had more time to do this, we would try to find a smarter way to split up the reviews (perhaps split it up into 300 word chunks, classify each separately, and take an average), find a way to get more GPUs to increase the word limit on our data when feeding into the model, and try out some more architectures.

Team Contributions:

Katie: Cleaned and processed data, designed model architecture, wrote training/validation/testing code, wrote background and approach sections of the write-up. Percent Contribution: 35%

Jason: Brainstorm model ideas, helped code/train final model, test modified loss function, write lessons learned section. Percent Contribution: 15%

Krutika: Web Scraped new Yelp reviews, Created all the visualizations of the baseline and final model results, Worked on Approach and Results section of the writeup. Percent Contribution: 35%

Vera: Helped clean and analyze data, trained with random layer sizes and learning rates, wrote the tools section. Percent Contribution: 15%

Works Cited:

Horev, Rani. “BERT Explained: State of the Art Language Model for NLP.” Medium, Towards Data Science, 17 Nov. 2018, towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270.

Khan, Suleiman. “BERT, RoBERTa, DistilBERT, XLNet — Which One to Use?” Medium, Towards Data Science, 17 Oct. 2019, towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8.

Rajapakse, Thilina. “A Hands-On Guide To Text Classification With Transformer Models (XLNet, BERT, XLM, RoBERTa).” Medium, Towards Data Science, 17 Apr. 2020, towardsdatascience.com/https-medium-com-chaturangarajapakshe-text-classification-with-transformer-models-d370944b50ca.

Team Number: 6969420

Team Name: YeNLP

Don’t forget to give us your 👏 !

--

--