Ling/CSE 472: Assignment 2: Morphology

Due April 14th, by 11:59pm

This assignment involves a little bit of coding with xfst. You'll need to turn in some code and results files, which should be submitted via Canvas.

Problem 1. Morphology and FSTs

(Adapted form of:) Problem 3.3 (p.81)

Using xfst, write a finite-state transducer that can generate and analyze a small set of verbs in all of their inflected forms. This FST will handle two spelling change rules: the rule that deletes a final e before -ing or -ed, and the rule that inserts a k when c appears between a vowel and -ing or -ed. The first rule is provided already. Your job is to write the second.

Note: xfst defines a language for regular expressions which makes it relatively easy to write morphophonological rewrite rules. For this problem, however, you must stay with the basic operators. No credit will be given for answers that use the xfst operator -> or its kin. On the other hand, if you get stuck, you might find it helpful to write the rule in that notation, and then examine the network that xfst produces.

To do this assignment, you'll need the following two files:

Copy them somewhere onto your Patas home directory. Log in to patas and do: 'wget path_to_file'. If you're using Windows to download them, make sure that it doesn't add any new file extensions.

verb_lexicon is the lexicon of verbs (in citation form) that we'll be working with.
k.xfst is the xfst script that does the work. It is the file you'll need to modify for this part of the assignment.

To start xfst, log onto Patas and type "xfst" (your $path variable should already be set appropriately). You'll get an xfst prompt.

To run the script, enter:

source k.xfst

After you've run the script, there should be an FST on the stack. To apply that FST, try:

apply up spruced

apply down picnic+ing

Observe that it doesn't yet have the right behavior in the second example.

Modify k.xfst until it has the right behavior. The files produced by the script (underlying, onerule, tworules and threerules) should be helpful in testing it as you go. You can also use apply up and apply down to observe the behavior of the network. Here is a short summary of xfst syntax.

Hint: If you have trouble, try commenting out the first rule and then writing a most simple rule as a first rule, to make sure you can at least observe some change happening. Make sure you understand the function of the ':' and the concepts fof upper and lower tape.

To examine a network, type:

print net

The network defined in k.xfst is too large to be usefully examined like this, but you might try some others:

read regex [a b c];

print net

read regex [a+ b c];

print net

read regex [e %+ -> 0 || _ [e|i] ];

print net

Turn in

k.xfst
this file as modified by you.
threerules
This file is output when the script k.xfst is run.
xfst_writeup.pdf
Write up: In 3-4 paragraphs, in a pdf file, answer the following questions:
1. Would it be easier to write a complete morphophonological analyzer for English represented orthographically (i.e. in standard spelling, where the lower tape has standard orthographical forms) or phonetically (where the lower tape has pronunciations in IPA)?
2. For each case, what representation would you choose for the upper tape, and why?
3. What are some applications that would require each kind of morphophonological analyzer for English, and what would happen if you fed them the wrong type?

Problem 2. Stress Movement in Spanish

Text-to-speech systems rely on large pronunciation dictionaries in combination with rules for unknown words, or for large morphologies. Since English doesn't have a very complex morphology, we will use stress patterns in Spanish to demonstrate the need for pronunciation rules. Spanish has a fairly simple pattern of stress assignment. It is, roughly:

Morphemes with pre-assigned stress keep that stress. (see below)
Words that end in a vowel, /s/, or /n/ have stress on the penult.
Otherwise, stress falls on the ultima.

There are complications to this (e.g., some clitics are extra-metrical), but it's not relevant for this assignment. We will focus on stress in verbs. Because Spanish has many verb forms, it can be simpler to use these rules to understand which syllable in the verb contains stress, rather than encoding stress separately for every possible verb form. For the first part of this assignment, we will focus on the present-tense indicative forms for one of the verb declensions, plus the infinitive form. Your job is to write an FST that takes in morpheme-segmented verbs, and outputs a verb form with stress assigned to the correct syllable. We will mark stress with a ' preceding the syllable's vowel(s). The table below gives the verb endings we are considering:

+o	1st.SG	+amos	1st.PL
+as	2nd.SG	+'ais	2nd.PL
+a	3rd.SG	+an	3rd.PL
+ar	INF

Note that the 2nd person plural morpheme +'ais comes with stress. This means that the ending "-ais" is always the stressed syllable in the word. Refer to the rules given above to work out the stress pattern for the other forms. For the root habl (speak), the two sides of the FST should look like:

habl+o	<-->	h'ablo
habl+'ais	<-->	habl'ais

That is, your FST needs to compose the roots given in the root lexicon (see below) with all possible endings, and then run that through a stress-assignment rule(s), to output a string with the stressed syllable marked with a '. Note that this means that (with the exception of the 2nd person plural) stress only occurs on one side of the tape. Also make sure your FST is taking each letter as a separate arc (that is, each arc should be one character, not "'ais" as a single transition!).

Your Task

Using what you learned in part 1, write an xfst that handles the 7 morphemes given above (present tense indicative, plus infinitive) for -ar verbs in Spanish. The verbs you should encode are given in span_lexicon_ar and are:
- habl (speak)
- cant (sing)
- trabaj (work)
This script should output its network to file, by executing a command such as write text > arVerbs. Use k.xfst as a rubric for writing this file.
Name your xfst script span1.xfst.
Extend span1.xfst to include one other phonemon of Spanish. Either
- Include the present indicative for 3 -er verbs (the conjugation can be found online easily). Use the -er verbs "com" (eat), "romp" (break), and "respond" (answer) (load from span_lexicon_er. You will need to load the two lexicons separately in your xfst script, to segregate the -ar and -er verbs from one another.
- Add the 6 preterite indicative conjugations for the -ar verbs above (again, the endings can be found easily online!).
This script should output its network to file in the same way. Hint: if you code span1.xfst well, the changes should be minimal.
Name this script span2.xfst.

XFST Miscellany

You may find it useful to use the 'explode' operators { }. In xfst regular expressions, characters not separated by spaces are treated as multicharacter symbols. This is not what you want. Therefore, (part of) your regular expressions will either have to look like this:
```
[ c a t | d o g ]
```
Or, using the 'explode' operators, like this:
```
[ {cat} | {dog} ]
```
Recall that % is the escape character, and that most punctuation marks have a special meaning in xfst and therefore need to be escaped. A short summary of xfst syntax can be found here.
You are welcome to use the xfst rule notations for problem 2. In particular, note the following:
```
[ A -> B || C _ D ]
```
This is the xfst form for A is rewritten as B when it occurs between C and D (where C and D refer to the upper tape context). A, B, C and D are all regular expressions.
Rules that run in parallel are separated by ,, (two commas):
```
[ A -> B || C _ D ,, E -> F || G _ H ]
```
Rules of epenthesis (which insert something where nothing was before) are written like this:
```
[ [..] -> A || B _ C ]
```
This can be read as "nothing goes to A in the context B _ C". If you used 0 (epsilon) instead of [..] in this rule, it would try to insert an infinite number of As between B and C, because there are an infinite number of empty strings between B and C.
You may find it convenient to define phonologically meaningful units, like consonants, vowels, etc.
Make sure to write a rule which removes the morpheme boundary, so that it does not appear in the surface forms.
Combine the above into an xfst script similar to the one we used for problem 1. The script should create one FST composed out of the underlying forms regular expression and the rules.
Test your FST by loading it and then calling "print lower". This will print the strings of the lower-tape language. (You can use "print upper" to see the upper tape language.) You may find it convenient to put this command at the end of your script.

Turn in

span1.xfst
Your xfst script containing the FST which accepts/generates all 21 verb forms (present indicative + infinitive) for hablar, cantar, and trabajar.
span2.xfst
Your xfst script that extends span1.xfst as described above.
span_writeup.pdf
The write up, containing:
1. Which version of span2.xfst you implemented
2. A description of the Spanish facts your span2.xfst handles.
3. A description of how you implemented stress movement
4. Any benefits and shortcomings to using an FST to model stress.
5. Any issues you had doing the assignment.

Back to main course page