The purpose of this lab is to introduce you to the NLPScholar toolkit we will be using in the class and to serve as a Python refresher. By completing this lab, you will demonstrate that you can:
This lab assumes that you have already cloned the NLPScholar
repository and have installed the nlp
environment by
following the instructions in Install.md
.
This lab has three parts:
Before starting each lab, get the latest version of the NLPScholar rep by first navigating to the folder on terminal and then executing
git pull
Read through the README of the Toolkit. Use the google doc template to answer the following questions:
Which experiment and mode would you use if you want to:
Train a model classify whether a given sentence is talking about the election, which experiment and mode would you use?
Identify the sentiment of each of the words in a dataset of movie reviews given a model that is already trained on this task.
Find the word by word probability of an interesting sentence you found on the internet.
Find the average accuracy of an existing part-of-speech tagger.
Write a config file that can train a roberta-large
model on the wikitext-103-v1
dataset,
which is within a larger Salesforce/wikitext
dataset. Set
the modelfpath to wiki_model
.
Where will the model that you trained in step 2 be saved?
You’ve trained a model to detect sarcasm and called it
sarcasm_model
. Evaluate this model on a new set of
sentences called test.tsv
which is stored in the following
folder: data/sarcasm/
. Set the predfpath as
test_results.tsv
Where will the predictions you generated be saved?
Write a config file that will use the pretrained
huggingartists/taylor-swift
causal langauge model, and give
you word by word predictability estimates for any sentence that you
enter.
Complete the two functions in Lab1.py. Make sure to read the function headers and docstrings carefully.
Use the code you wrote in Part 2 and the google doc template to answer the following questions:
What is the most frequent word in the
through-the-looking-glass.txt
file? What is the 10th least
frequent word?
For the least frequent word, think of a sentence with the word. What is the probability of that word in that sentence according to the masked language model distilbert-base-uncased? Does that probability accord with its frequency in our corpus? If not, what may cause this difference?