Software
This page includes a list of corpora and NLP systems that my colleagues and
I have built in the past.
Corpora
- The ODIN database: A collection of
Interlinear Glossed Text (IGT) covering more than a thousand languages.
- The HNZ corpus: An Archaic Chinese corpus consisting of all the articles in the book of Huainanzi with word segmentation and POS tagging annotation.
- The Hindi/Urdu Treebank: A multi-layer, multi-representational treebank for Hindi/Urdu.
The Treebank includes dependency structure, phrase structure, and PropBank annotation for 400K words of Hindi and 150K words of Urdu.
- The
Chinese Penn Treebank: One of the most commonly used treebanks for
Chinese NLP. It currently contains manually annotated parse trees
for 1.2 million words. The corpus is released via
LDC.
NLP systems
Last
modified on 6/17/2014