COSC 426 F24 Lab 8
COSC 426 F24 Lab 8
Introduction
In this lab you will build and evaluate a model for token classification. By completing this lab, you will demonstrate that you are able to train, evaluate, and use a token classification model for a real world application.
Provided files
Lab8.py
- A google doc template
What to submit
Lab8.py
- A pdf of your google doc
Part 0: Setting Up
Pull the most recent version of NLPScholar
using git pull
.
Part 1: Our Goal (across the lab and the homework)
Our broad goal across the homework and the lab is to build a prototype news search engine. As a motivating example, consider the following text:
while eating an apple, the founder of apple thought of a logo in san francisco
We’d like to be able to label this text with labels, like that the second apple and not the first apple is an organization and that san francisco is a location. That way, people can search for apple and return only those articles about the corporation and not the food, for example.
In the lab, we will train a model, a named-entity recognition model, that labels text with their tags. We can use these tags to extract phrases which we call entities. In the homework, we will use our trained model to build a small search engine over news data.
For training our model, we need data labeled with entity tags. Data is given in
ner_news_data
. For using our model for search, we need news articles. That
data is provided in news_data
. These folders include information about the
data. Please review this.
Lab.py
includes functions with docstrings that will help scaffold your
approach. Please review these and follow up with your instructor prior to
continuing on.
Question 1: Sketch out the task of the lab. What will you do? What data will you use? What will your functions do?
Question 2: Sketch out the task of the hw. What will you do? What data will you use? What will your functions do?
Part 2: Training an NER Model
In this part, you will train the NER model you will use in the homework. Data is
provided in the ner_news_data
folder. You should formulate a config file that
trains bert-base-cased
for one epoch on token classification. Please review
the following folder on NLPScholar for an example of training an token
classification model on part-of-speech tagging:
link. You will submit this config file.
Part 3: Extracting Entities
In this part, you will use your trained NER model to extract entities from some
news data. For that, we need to accomplish two things, (i) formatting our test data
in the format needed for evaluate
mode with TokenClassification
and (ii)
extracting three types of entities from the text.
For (ii), we are focusing on three tags, LOC
for location, ORG
for
organization, and PER
for person. Notice that tags come with an initial bit of
information, B
, I
, L
, U
. These mean beginning, inside, last, or unit,
respectively. Consider the following illustrative example of predicted tags to
help understand our goal (the tags aren’t necessarily correct, just trying to
give a motivating example):
Word | Tag |
---|---|
Bill | U-Per |
Peter | U-Per |
and | O |
Melinda | B-Per |
Gates | L-Per |
went | O |
to | O |
visit | O |
Orlando | B-LOC |
Disney | B-ORG |
World | I-ORG |
Saratoga | B-LOC |
Springs | L-ORG |
We would extract entities PER: [‘Bill’, ‘Pete’, ‘Melinda Gates’], LOC: [‘Orlando’, ‘Saratoga’], and ORG: [‘Disny World’, ‘Springs’]. Note, how U yields an entity immediately, and how B/I/L are or are not referenced/used.
You will submit your evaluation config file and Lab8.py
which contains the
functions you need.