UW Computational Linguistics Laboratory

Corpora List and Access Policies

In order to ensure compliance with the licenses for the various corpora we have installed, we have instituted the following policies.

  1. Compling laboratory members are granted access to corpora solely for coursework and research projects in the context of their affiliation with the UW.
  2. Corpora may not be copied from the servers, nor used in commercial applications.
  3. Many of the corpora have additional licensing conditions (see links in the list below). Before you access any particular corpus, you are responsible for reading and understanding the license, in addition to the general membership agreement.
  4. For some of the corpora (marked "Restricted Access"), we must maintain a list of individuals granted access and/or have each user sign an individual license agreement. To access these corpora, you'll need to contact the lab director to obtain read permissions on the relevant directories.
  5. Whenever you use a corpus for course work or for a paper, you should cite the corpus among your references. The proper citation information should be found in the license or README file of the corpus.
  6. Failure to follow these policies could result in loss of access to the corpora, or to the lab/servers in general.

Corpora we are installing

TitleLDC Catalogue numberRestricted accessLanguage(s)License link
Callhome Egyptian Arabic Transcripts SupplementLDC2002T38Arabicgeneral
Arabic GigawordLDC2003T12Arabicgeneral
Arabic Treebank: Part 2 v 2.0LDC2004T02Arabicgeneral
Arabic Treebank: Part 3 v 1.0LDC2004T11Arabicgeneral
Arabic News Translation Text Part 1LDC2004T17Arabicgeneral
Arabic Treebank: Part 1 v 3.0 (POS with full vocal.+ syntactic analysisLDC2005T02Arabicgeneral
CALLHOME Egyptian Arabic TranscriptsLDC97T19Arabicgeneral
TIDES Extraction (ACE) 2003 Multilingual Training DataLDC2004T09Arabic, Chinese, Englishgeneral
ACE 2004 Multilingual Training CorpusLDC2005T09Arabic, Chinese, Englishgeneral
Arabic Treebank Part 1 --- 10K-word English translationLDC2003T07Arabic, Englishgeneral
Multiple-Translation Arabic (MTA) Part 1LDC2003T18Arabic, Englishgeneral
Arabic English Parallel News Part 1LDC2004T18Arabic, Englishgeneral
Arabic English Parallel News Part 1LDC2004T18Arabic, Englishgeneral
Multiple-Translation Arabic (MTA) Part 2LDC2005T05Arabic, Englishgeneral
TREC MandarinLDC2000T52Chinesespecific
Chinese GigawordLDC2003T09Chinesegeneral
Chinese Treebank 5.0LDC2005T01Chinesegeneral
Mandarin Chinese News TextLDC95T13Chinesespecific
CALLHOME Mandarin Chinese TranscriptsLDC96T16Chinesegeneral
TDT2 Multilanguage Text Version 4.0LDC2001T57"Chinese English"general
TDT3 Multilanguage Text Version 2.0LDC2001T58Chinese, Englishgeneral
Chinese-English Translation Lexicon (v3.0)LDC2002L27Chinese, Englishgeneral
Multiple-Translation Chinese CorpusLDC2002T01Chinese, Englishgeneral
SummBank 1.0LDC2003T16Chinese, Englishgeneral
Multiple-Translation Chinese (MTC) Part 2LDC2003T17Chinese, Englishgeneral
Multiple-Translation Chinese (MTC) Part 3LDC2004T07Chinese, Englishgeneral
Hong Kong Parallel TextLDC2004T08Chinese, Englishspecific
Chinese News Translation Text Part 1LDC2005T06Chinese, Englishgeneral
Chinese-English News Magazine Parallel Text LDC2005T10Chinese, Englishgeneral
Czech Broadcast News TranscriptsLDC2004T01Czechgeneral
Prague Dependency Treebank 1.0LDC2001T10Czech, Englishgeneral
Prague Czech-English Dependency Treebank Version 1.0LDC2004T25Czech, Englishgeneral
Grassfields Bantu Fieldwork: Dschang LexiconLDC2003L01Dschanggeneral
Grassfields Bantu Fieldwork: Dschang Tone ParadigmsLDC2003S02Dschanggeneral
CELEX 2 LDC96L14Dutch, German, Englishspecific
Santa Barbara Corpus of Spoken American English Part-ILDC2000S85Englishgeneral
BLLIP 1987-89 WSJ Corpus Release 1LDC2000T43Englishspecific
MUC 7LDC2001T02Englishgeneral
Temporal Evaluation ExamplesLDC2002E05Englishgeneral
RST Discourse TreebankLDC2002T07Englishgeneral
The AQUAINT Corpus of English News TextLDC2002T31Englishgeneral
Santa Barbara Corpus of Spoken American English Part-IILDC2003S06Englishgeneral
ACE-2 Version 1.0LDC2003T11Englishgeneral
MUC 6LDC2003T13Englishgeneral
SLX Corpus of Classic Sociolinguistic InterviewsLDC2003T15Englishgeneral
ANC First ReleaseLDC2003T20Restricted accessEnglishspecific
Santa Barbara Corpus of Spoken American English IIILDC2004S10Englishgeneral
Proposition Bank ILDC2004T14Englishgeneral
ACE Time Normalization (TERN) 2004 English Training Data v1.0LDC2005T07Englishgeneral
English Gigaword Second Edition LDC2005T12Englishgeneral
CCGbankLDC2005T13Englishgeneral
HCRC Map Task CorpusLDC93S12Englishgeneral
ACL/DCILDC93T1Englishspecific
North American News Text CorpusLDC95T21Restricted accessEnglishspecific
English Treebank 2LDC95T7Englishgeneral
COMLEX Syntax Text Corpus Version 2.0LDC96T11Restricted accessEnglishspecific
DSO Corpus of Sense-Tagged EnglishLDC97T12Englishgeneral
CALLHOME American English TranscriptsLDC97T14Englishgeneral
COMLEX English syntax LexiconLDC98L21Restricted accessEnglishspecific
North American News Text SupplementLDC98T30Restricted accessEnglishspecific
Treebank-3LDC99T42Englishgeneral
Hansard French/EnglishLDC95T20French, Englishgeneral
European Language Newspaper TextLDC95T11Restricted accessFrench, German, Portuguesespecific
UN Parallel Text (Complete)LDC94T4AFrench, Spanish, Englishspecific
CALLHOME German TranscriptsLDC97T15Germangeneral
Japanese Business News TextLDC95T8Restricted accessJapanesespecific
CALLHOME Japanese TranscriptsLDC96T18Japanesegeneral
Japanese Business News Text SupplementLDC99T34Restricted AccessJapanesespecific
Korean NewswireLDC2000T45Koreangeneral
Korean Telephone Conversations TranscriptsLDC2003T08Koreangeneral
Klex: Finite-State Lexical Transducer for KoreanLDC2004L01Koreangeneral
Morphologically Annotated Korean TextLDC2004T03Koreangeneral
Korean English Treebank AnnotationsLDC2002T26Korean, Englishspecific
ECI Multilingual TextLDC94T5Multispecific
Grassfields Bantu Fieldwork: Ngomba Tone ParadigmsLDC2001S16Ngombageneral
CetempublicoLDC2001T62Portuguesespecific
Portuguese Newswire TextLDC99T40Portuguesegeneral
CALLHOME Spanish Dialogue Act AnnotationLDC2001T61Spanishgeneral
Spanish Newswire Text, Volume 2LDC99T41Spanishgeneral