Fei Xia - Software

Software

This page includes a list of corpora and NLP systems that my colleagues and I have built in the past.

The ODIN database: A collection of Interlinear Glossed Text (IGT) covering more than a thousand languages.
The HNZ corpus: An Archaic Chinese corpus consisting of all the articles in the book of Huainanzi with word segmentation and POS tagging annotation.
The Hindi/Urdu Treebank: A multi-layer, multi-representational treebank for Hindi/Urdu. The Treebank includes dependency structure, phrase structure, and PropBank annotation for 400K words of Hindi and 150K words of Urdu.
The Chinese Penn Treebank: One of the most commonly used treebanks for Chinese NLP. It currently contains manually annotated parse trees for 1.2 million words. The corpus is released via LDC.

Last modified on 6/17/2014