For the majority of the world’s languages, the amount of linguistic resources (e.g., annotated corpora and parallel data) is very limited. Consequently, supervised methods and many unsupervised methods cannot be applied directly, leaving these languages largely untouched and unnoticed. Another crucial issue, which has received less attention from the natural language processing (NLP) community, is that to date there have been very few studies that examine a large number of languages and incorporate cross-lingual information into NLP systems. As a result, languages are researched and processed in isolation rather than being looked at as part of a big language family.
This project has two intertwined goals. The first goal is to create a framework that allows the rapid development of resources for resource-poor languages (RPLs). We will accomplish this goal by bootstrapping NLP tools with initial seeds created by projecting syntactic information from resource-rich languages to RPLs. The second goal is to use the automatically created resources to perform cross-lingual study on a large number of languages to discover linguistic knowledge. The knowledge will not only deepen our understanding on languages, but also provide additional information that can be incorporated into the bootstrapping module to produce better NLP tools.
Previous research on unsupervised learning often requires resources such as parallel data and a big lexicon, which are often unavailable to many RPLs. In this project, we require only two types of resources: monolingual data and interlinear glossed text (IGT). We explore two key ideas in the proposal. The first idea is to take advantage of resource-rich languages by using them to create seeds for bootstrapping NLP tools. The second idea is to identify the relation between languages and use that information to help machine learning. Both ideas point to one direction; that is, languages are related to one another and should be treated as related.