Ling 472/CSE 472: Introduction to Computational Linguistics
Spring 2017

Final project information

Project specifications

All projects must be evaluated in terms of precision and recall.
The project must include a two (or more) stage experiment where the second stage tries to improve on the first by changing some aspect of the technique, an evaluation in terms of precision and recall (for each stage), and a comparison to some baseline.

Sample project ideas

Create a morphological analyzer for a morphologically complex language. Evaluate on short running text (held out for evaluation purposes). Stage one assumes known roots. Stage two provides morphological "guesser" facilities. Measure precision, recall, and ambiguity.
Write a program that classifies email texts as indicating that a file should be attached or not. Stage one uses hand-built linguistic cues. Stage two extracts n-grams correlated with attachments automatically.
Write a program to transliterate some non-ascii text to an ascii-based writing system. Stage one uses only completely regular rules. Stage two allows for exceptions. Evaluate precision and recall in both directions.
Write a program that uses the data in PanLex to detect words which have been borrowed from one language into another, and determine which language is the source language. The phonotactic similarity of the word to words in each language can be a clue as to which language is the source. Evaluate precision by taking a sample of results and verifying in dictionaries if there has been borrowing and if so, what the direction was. Evaluate recall by taking a sample of known borrowings from the dictionary which are represented in the Translation Graph and seeing which are found. Hard!
Using PanLex, measure to what extent translational equivalents share (coarse-grained) part of speech across languages. Phase one: Gather gold standard POS dictionaries for a sample of languages in the Translation Graph. Develop a mapping between the POS tags in these resources, perhaps generalizing away from the most fine-grained distinctions. Project POS tags from one language to another (or to others) and then measure the precision and recall of [pos, lemma] pairs against the gold-standard. Phase two: use two languages as input languages, or look into using language-internal morphological cues as an additional data source.

Project presentation and write-up

Initial project plan due 4/21, specifying task to be attempted, data to be used, baseline, and means of measuring precision, recall and any additional metrics.
Revised project plan due 5/5, responding to feedback from instructors.
Phase 1 results together with an outline of the project write up due 5/29.
Presentation/demonstration of your working system in class 6/1 or 6/2.
Final project in executable state (i.e., so we can run it on patas and see the results) + write up due 11:59pm on 6/7.
Write up requirements (8 pages, double-spaced):
1. Background: What is the problem, how are you approaching it, and what are your two stages?
2. Data: What data are you using, where did you get it, what is your gold standard, how is your data divided between development and test?
3. Methodology, stage 1: How did your Stage 1 system work?
4. Results: What is your baseline? Precision and Recall for baseline and stage 1 system, comparison to baseline, and f-measure or additional measure if applicable. Your presentation of your results should include a table showing P and R for baseline, stage 1 and stage 2. NB: Even if your system returns a label for every input, you can still calculating precision and recall separately by considering one class (in a binary system) as the target. (For N-way classifications, you can calculate P and R for each label by looking at that label v. everything else.)
5. Methodology, stage 2: What did you change?
6. Results: Precision and Recall for stage 2, comparison to stage 1 and baseline.
7. Detailed instructions explaining how we can run your code on patas to make sure it runs. Note that we only need to run the testing (or decoding) step, not any training steps. If your system runs on condor, include a condor submit file. If your system is going to take more than a few minutes to run, warn us of that in the write up.
8. Discussion: What are the implications of this project for broader inquiry in computational linguistics?
9. Bibliography: For any resources you are using (corpora, toolkits, etc) you should include a proper citation, both in the text (as (Author, year)) and in the bibliography. Likewise for any works cited.

Note: 50% of the project grade will be allocated to the write up. The other 50% will be for the project, but we can only understand the project via the write up. (We will not be digging through everyone's source code.) In other words: the write up is very important.

If you are submitting files as an archive to Canvas, please:

Submit the write up separately, as .pdf.

Use zip or gzip, not rar or other more exotic archiving systems.

Group work

You are encouraged to work in pairs.
For partner projects:
- project plan must include description of how the work will be allocated.
- project write-up must include a clear description of who did what