# Lingusitics/CSE 472
# Autumn 2004
# Assignment 2

# Build a small FST for the verbal suffixes we're going to use,
# and store it in the variable suffixes.  Note that the suffixes
# all begin with a morpheme boundary symbol, and one of the suffixes
# is otherwise the empty string.

define suffixes [ %+ e d | %+ i n g | %+ s | %+[]];

# Build a small FST for a list of words which will be our
# roots, and store it in the variable verbs.

read text verb_lexicon
define verbs

# Build an FST for underlying forms, i.e., all concatenations of
# verbs and suffixes.

define underlying [verbs suffixes];

# Print out the underlying forms to a file.

read regex underlying;
write text > underlying
pop stack

# Define one rule which maps underlying "e+e" to "e" and
# underlying "e+i" to "i".  Also maps "e+" followed by something
# other than e or i to just "e".

# In Rule1, the first disjunct says (effectively) "remove e and + if
# the next thing is an e or an i."  The second disjunct says say
# "remove a + if it's between an e and something that's not an e or an
# i".  The final two let everything else through unchanged, where
# everything else is strings where there's a + that is preceded by
# something other than e, or, for completeness, strings with no +.  

# I've chosen to get rid of the + boundary symbols here because
# of a rule ordering problem: I want the k rule to be later, but
# this rule could produce c+[e|i] sequences if I didn't get rid
# of the + sign.

# Note that % is an escape character, so that %+ is the literal
# plus sign.  ? is a wildcard, matching any character.  0 is epsilon.
# \[e|i] means a single character which is neither e nor i. () indicate
# optionality. 

define Rule1 [ [ ?* e:0 %+:0 [e|i] ?*] | 
               [ ?* e %+:0 (\[e|i])] | 
               [ ?* \e %+ ?* ] | 
               [ \[%+]* ] ];

# XFST provides additional regular expression syntax that allows
# us to define the a very similar rule as follows: (It's not quite
# the same because the || operator matches the upper tape context only,
# whereas Rule1 above requires the relevant context on both upper and
# lower tapes.  The difference won't matter for our current purposes.)

# define Rule1 [ e %+ -> 0 || _ [e|i] ,, %+ -> 0 || e _ \[e|i] ];

# This notation is intentionally very close to ordinary linguistic
# rewrite rules.  The point of this exercise is to understand how
# that can actually be handled in terms of regular relations by
# explicitly not using the easier notation.

# Compose the underlying forms network with the Rule1 network.

define onerule [ underlying .o. Rule1 ];

# Print the forms that are part of the lower language of onerule
# to a file.

read regex onerule.l;
write text > onerule
pop stack

# Define another rule which maps "_c+ed" and "_c+ing" in the
# underlying form to "_ck+ed" and "_ck+ing" just in case _ is
# a vowel.  Pass all other forms through unchanged.

# This is the rule you need to modify for the assignment.  In order
# for the script to start in a working state, I've put in a 
# dummy definition of the rule here: in this initial version, the
# rule is just the identity relation; all strings are mapped to themselves.

define Rule2 ?*;

# Compose the new rule with the network so far.

define tworules [ onerule .o. Rule2 ];

# Print the forms that are part of the lower language of tworules
# to a file.

read regex tworules.l;
write text > tworules
pop stack

# Remove any remaining morpheme boundaries.

define Rule3 [ [ ?* %+:0 ?* ] | [ \[%+]* ] ];
define threerules [ tworules .o. Rule3 ];

# Print the forms that are part of the `surface' language of
# the final composed transducer to a file.

read regex threerules.l;
write text > threerules
pop stack

# Leave the new transducer on the stack so it can
# be played with interactively.

read regex threerules;