COSC 426 F24 HW 1

In the Lab 3 we worked with the Standard American English dialect. Some of the sentences that were ungrammatical in this dialect can be grammatical in other dialects. For example, the following sentences considered grammatical by (at least some) speakers of Indian English in specific situations.

The following sentences, however, are considered ungrammatical.

Note, these sentences do not make up the comprehensive set of grammatical and ungrammatical sentences, but are just illustrative examples.

Your goal in this homework is to evaluate if the pretrained LM distilbert/distilgpt2 treats sentences from Indian English as being grammatical

Provided files

To submit

Make sure that your grammar from Lab3 works as intended before trying to modify it

Part 1: Adapting grammar to Indian English

Based on the examples above, what is the difference between Standard American English and Indian English dialects? Modify your grammar so it can accept sentences from Indian English. Write test cases to test your grammar.

Part 2: Setting up the appropriate minimal pair contrast

In this part your goal is to evaluate if the pretrained LM distilbert/distilgpt2 treats sentences from Indian English as being grammatical. One approach to do this is to embed the sentence in a fronted sentential complement.

For example, if you wanted to verify that a sentence like the panda gave/sent/lent the sandwich, you could compare the following minimal pairs.

You could swap out annoyed with verbs like perplexed and surprised.

In the google doc template, answer the following questions:

  1. Which word(s) in the minimal pair would you look at and why? (i.e., what is the ROI)
  2. If distilbert/distilgpt2 considered sentences from Indian English to be grammatical, what patterns would you expect to see in the microdiff column or the accuracy column of your results file? Why?
  3. Test your intuitions using the interact mode for one minimal pair. Include screenshots. What do you observe and what do you think it tells you?

Part 3: Does distilgpt2 treat sentences from Indian English as being grammatical?

In this part you should use the NLP Scholar pipeline to more systematically evaluate whether distilbert/distilgpt2 treats Indian English sentences as being grammatical.

Here are some things you should figure out before you run the pipeline:

In the google doc template answer the following questions:

  1. What was your prediction? Why?
  2. What do you observe? Were any of the results surprising? Why or why not?

Part 4: Finetuning distilgpt2 on sentences from Indian English

Generate 10000 sentences from your grammar with a maximum depth of 6. Finetune distilbert/distilgpt2 model on this data. Use 90% of the sentences for your training, and 10% for validation.

Part 5: Does finetuning change how distilgpt2 treat sentences from IE?

Evaluate your finetuned model on the same sentences from Indian Englsh.

In the google doc template answer the following questions:

  1. What was your prediction? Why?
  2. What do you observe? Were any of the results surprising? Why or why not?

Part 6: Discussion/ Reflection

What are the limitations of the experiment you ran? What are some changes you would make to the experimental setup if you wanted to more robustly study the following questions:

  1. Does distilgpt2 treat sentences from different dialects as being equally grammatical?
  2. Does finetuning distilgpt2 on specific dialects result in the model treating sentences from the dialects as being more grammatical?