Ling 472/CSE 472: Introduction to Computational Linguistics
Spring 2012
Final project information
Project specifications
- All projects must be evaluated in terms of precision and recall.
- The project must include a two (or more) stage experiement where
the second stage tries to improve on the first by changing some aspect
of the technique, an evaluation in terms of precision and recall (for
each stage), and a comparison to some baseline.
Sample project ideas
- Create a morphological analyzer for a morphologically complex
language. Evaluate on short running text (held out for evaluation purposes).
Stage one assumes known roots. Stage two provides morphological
"guesser" facilities. Measure precision, recall, and ambiguity.
- Write a program that takes the collected tweets
from the class and classifies them according to their subtypes
(based on the hashtags we will develop). Stage one uses unigrams as
classification features, stage two uses higher-order n-grams or
the output of syntactic parsing or some additional resource.
Evaluate precision and recall.
- Write a program that takes running text from N languages (labelled
with language ID) as training data and produces a system able to classify
new text according to which language it represents. Stage one uses character
n-grams. Stage two incorporates frequency counts for language-specific
high frequency words. Evaluate precision and recall.
- Write a program that classifies email texts as indicating that
a file should be attached or not. Stage one uses hand-built linguistic
cues. Stage two extracts n-grams correlated with attachments automatically.
- Write a program to transliterate some non-ascii text to an
ascii-based writing system. Stage one uses only completely regular
rules. Stage two allows for exceptions. Evaluate precision and recall
in both directions.
- Write a program that uses the data in PanLex
to detect words which have been borrowed from one language into another,
and determine which language is the source language. The phonotactic
similarity of the word to words in each language can be a clue as to
which language is the source. Evaluate precision by taking a sample
of results and verifying in dictionaries if there has been borrowing
and if so, what the direction was. Evaluate recall by taking a sample
of known borrowings from the dictionary which are represented in the
Translation Graph and seeing which are found. Hard!
- Using PanLex, measure to what extent translational equivalents
share (coarse-grained) part of speech across languages. Phase one: Gather gold
standard POS dictionaries for a sample of languages in the Translation
Graph. Develop a mapping between the POS tags in these resources, perhaps
generalizing away from the most fine-grained distinctions. Project POS tags from one language to another (or to others) and then measure the precision and
recall of [pos, lemma] pairs against the gold-standard. Phase two:
use two languages as input languages, or look into using language-internal
morphological cues as an additional data source.
Project presenation and write-up
- Initial project plan due 4/20, specifying task to be attempted,
data to be used, baseline, and means of measuring precision, recall
and anything additional metrics.
- Phrase 1 results together with an outline of the project write
up due 5/25.
- Presentation/demonstration of your working system in class 5/30 or
6/1.
- Final project in executable state (i.e., so we can run it on patas
and see the results) + write up due 2:30pm on 6/4.
- Write up requirements (8 pages, double-spaced):
- Background: What is the problem, how are you approaching it,
and what are your two stages
- Data: What data are you using, where did you get it, what is
your gold standard, how is your data divided between development and test?
- Methodology, stage 1: How did your Stage 1 system work?
- Results: What is your baseline? Precision and Recall for baseline
and stage 1 system, comparison to baseline, and f-measure or
additional measure if applicable. Your presentation
of your results should include a table showing P and R for baseline,
stage 1 and stage 2. NB: Even
if your system returns a label for every input, you can still calculating
precision and recall separately by considering one class (in a binary
system) as the target. (For N-way classifications, you can calculate
P and R for each label by looking at that label v. everything else.)
- Methodology, stage 2: What did you change?
- Results: Precision and Recall for stage 2, comparison to
stage 1 and baseline.
- Detailed instructions explaining how we can
run your code on patas to make sure it runs. Note that we only need
to run the testing (or decoding) step, not any training steps. If your
system runs on condor, include a condor submit file. If your system is
going to take more than a few minutes to run, warn us of that in the
write up.
- Discussion: What are the implications of this project for
broader inquiry in computational linguistics?
- Bibliography: For any resources you are using
(corpora, toolkits, etc) you should include a proper citation, both
in the text (as (Author, year)) and in the bibliography. Likewise for
any works cited.
Note: 50% of the project grade will be allocated to the write up.
The other 50% will be for the project, but we can only understand
the project via the write up. (We will not be digging through everyone's
source code.) In other words: the write up is very important.
If you are submitting files as an archive to Canvas, please:
- Submit the write up separately, as .pdf.
- Use zip or gzip, not rar or other more exotic archiving systems.
Group work
- You are encouraged to work in pairs.
- For partner projects:
- project plan must include description of how the work will be allocated.
- project write-up must include a clear description of who did what
Back to course page