This page includes a list of corpora and NLP systems that my colleagues and
I have built in the past.
- The ODIN database: A collection of
Interlinear Glossed Text (IGT) covering more than a thousand languages.
- The HNZ corpus: An Archaic Chinese corpus consisting of all the articles in the book of Huainanzi with word segmentation and POS tagging annotation.
- The Hindi/Urdu Treebank: A multi-layer, multi-representational treebank for Hindi/Urdu.
The Treebank includes dependency structure, phrase structure, and PropBank annotation for 400K words of Hindi and 150K words of Urdu.
Chinese Penn Treebank: One of the most commonly used treebanks for
Chinese NLP. It currently contains manually annotated parse trees
for 1.2 million words. The corpus is released via
modified on 6/17/2014