In this homework you will be using the code you wrote in Lab 6 and
the NLPScholar
toolkit to train and evaluate different
Bayesian classifiers on the IMDB sentiment
analysis dataset from Stanford. This assignment is Due
Monday Oct. 28.
HW2.py
HW2.py
As a first step, you will need to format the data from this dataset in a way that is usable with your classification code. Download the zip folder from the link above. Check out the data and get a sense for the structure of the folder. Create train, validation, and test sets in the correct format. Note: this is intentionally left a bit vague. Handling data preprocessing like this is an important skill in NLP and will help you as you work on your final project. Take this as an opportunity to productively struggle through this.
Use the code you wrote in Lab 6 to build a Bayesian Classifier that uses a bigram Language Model with add-0.01 smoothing. Report the accuracy of your model on the test split of your data in your google doc.
Use NLPScholar
and the code you wrote in Lab 6 to build
a Bayesian classifier that uses the
distilbert/distilbert-base-cased
Language Model
(not the text classification model). Report the
accuracy of your model on the test split of your data.
You should finetune the model for one epoch on the
relevant data. Note, while training is relatively quick, evaluating your
a masked language model on data can take time. Make use of turing jobs,
start small, and leave yourself time.
In your google doc, also describe how building a Bayesian Classifier
using distilbert/distilbert-base-cased
is different from
using this model to run a TextClassification
experiment
using NLPScholar
.
Answer the following questions in the google doc template:
The result of evaluate for both the Bigram and Neural LM was a tsv file with one row per token, along with probability estimates associated with the word. Conceptually, what do the probabilities represent for each of the models?
Pick a few reviews and describe what differences you observe between the token-by-token probabilities between the models.
Which model was ultimately better at sentiment classification? Why does this make sense?
The Bigram LM based Bayesian Classifier you trained was only one of many types of models you could have trained.
There are at least three properties (or hyperparameters) of your bigram model you can modify. List what these are. Which of these properties do you think is most likely going to have an impact on classification accuracy? Why?
For the hyperparameter that you think is most likely going to have an impact on classification accuracy, concretely explore how different settings of this hyperparameter influences classification accuracy.
Which of the hyperparameter settings that you tried resulted in the best classification accuracy? Why do you think this was the case?