COSC 426 F24 Lab 8

COSC 426 F24 Lab 8

Introduction

In this homework you will build on Lab 8 to build a small news search engine. In this lab you will build and evaluate a model for token classification. By completing this homework, you will demonstrate that you are able to use a token classification model for a real world application.

Provided files

What to submit

Part 0: Setting Up

Pull the most recent version of NLPScholar using git pull.

Part 1: Our Goal (across the lab and the homework)

Our broad goal across the homework and the lab is to build a prototype news search engine. As a motivating example, consider the following text:

while eating an apple, the founder of apple thought of a logo in san francisco

We’d like to be able to label this text with labels, like that the second apple and not the first apple is an organization and that san francisco is a location. That way, people can search for apple and return only those articles about the corporation and not the food, for example.

In the lab, you built a model which tags data with labels. In the homework, you will use this model on news data (contained in news_data) to build a news search engine.

HW3.py includes functions with docstrings that will help scaffold your approach. Please review these and follow up with your instructor prior to continuing on if you don’t understand what is provided.

Part 2: Building a search engine

By the end of the homework, you should have a sample interface like:

Entity tag (PER|LOC|ORG|STOP): DATE
Tag not in database
Entity tag (PER|LOC|ORG|STOP): LOC
Search phrase: Syracuse
         Four years ago, a jury convicted Robert Neulander, a prominent
         doctor in central New York, of killing his wife in their
         suburban home, seemingly ending a complicated case that had
         riveted the Syracuse area. It was actually about to get more
         complicated. On the day of the verdict, an alternate juror
         contacted a lawyer for Mr. Neulander.

         While this system didn't break any snow records in Syracuse, it brought
         plenty of cold. Record low maximum for Nov. 12 was shattered by five
         degrees, the new record of 27 was set on Tuesday and a new record low
         was also set in Syracuse for this date as the mercury dipped to the
         lower teens.

         Four years ago, a jury convicted Robert Neulander, a prominent
         doctor in central New York, of killing his wife in their
         suburban home, seemingly ending a complicated case that had
         riveted the Syracuse area. It was actually about to get more
         complicated. On the day of the verdict, an alternate juror
         contacted a lawyer for Mr. Neulander.

         Spencer Martin re-assigned to Syracuse and two forwards sent to
         Orlando.

Entity tag (PER|LOC|ORG|STOP): ORG
Search phrase: Colgate University
Phrase not found

Think through the steps of this. You will need to label the news data for tags. You need to extract entities. And then, you need to build a database to allow for searching. Use your Lab8.py. In case your model doesn’t work well, I’ve provided a trained ner model following the Lab 8 instructions. You should be able to access it on turning at the following location (you should be able to reference it in your config files):

/datalake/fdavis/COSC426/ner_model