Corpus resources:
Corpora and electronic text databases

This page contains links to lists of available corpora and descriptions of individual corpus projects. Because of the nature of WWW, there is considertable overlap between some of the lists. Some of the corpora linked to here are freely available, others only for a fee.

NOTE This page is not actively maintained. For more up-to-date information, you might try the ACL wiki page on resources by language.

Jump to:

Lists of corpora

LDC catalogue
W3C list of corpora
ELRA catalogue, written
ELRA catalogue, spoken
Tractor collection of texts
Linguistic exploration list of resources
ICAME International Computer Archive of Modern and Medieval English
Prof. Matsumoto's list of language resources (This page is in Japanese and is a good resource for information on Japanese corpora.)

Pages for specific corpora, by language:

Multilingual
Modern English
Earlier English
Basque
Catalan
Czech
French
Galician
German
Hebrew

Italian
Japanese
Norwegian
Portuguese
Russian
Serbo-Croatian
Spanish
Slovene
Turkish

Multilingual

CHILDES: Child Language Data Exchange System
Silfide wordlists Word frequency lists for English, French and German based on the Silfide corpora.
Silfide Texts in French, English, German, Danish, Italian, Spanish and Portuguese. Some of these texts appear to be translations, and many appear in more than one language.
UPF corpus Texts (law, environment, medicine, economy and IT) in Catalan, Spanish, French, English and German. (Online tools and demos) (Information in English)
Corpus Linguistico da Universidade de Vigo (Parallel Corpora for Galician and English/French/Spanish; also Spanish/Basque, English/Portuguese, and English/Spanish) (English page)
EuroWordNet A mulitlingual database with wordnets for several European languages

Modern English

ICAME International Computer Archive of Modern and Medieval English. (Includes the Brown and LOB corpora, as well as the Frown and FLOB corpora, and many others.)
BNC-WORLD The British National Corpus is finally available outside the UK. ... and it is indexed.
The COBUILD Bank of English
American National Corpus (The ANC is in the process of being built. This link gives information on the project.)
Information about the Santa Barbara Corpus of Spoken American English Part I is now available from the LDC
ACE: Australian Corpus of English
CSPAE: Corpus of Spoken Professional American English
The Bergen Corpus of London Teenage Language (COLT)
LLC: London-Lund Corpus of spoken British English
WSC: Wellington Corpus of Spoken New Zealand English
WC: Wellington Corpus of Written New Zealand English
Kolhapur Corpus of Indian English
The CMU Pronouncing Dictrionary
The Michigan Corpus of Academic Spoken English
International Corpus of English (ICE)
Computational Linguistics Group, University of Wolverhampton 30,000 word corpus of English technical manuals, annotated for coreference and anaphora.
HCRC Map Task Corpus XML annotations
Electronic text databases:
- Project Gutenberg
- Oxford Text Archive
- University of Virginia's Electronic Text Center
- European Manuscript Server Initiative
- UConn's list of electronic text centers
- The Internet Archive (As of 5/2001, the Internet Archive has over 43 Terabytes of data, primarily web pages.)

Earlier English

The Penn-Helsinki Parsed Corpus of Middle English
The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
Helsinki Corpus of English (Diachronic)
Lampeter Corpus of Early Modern English Tracts
Some of the electronic text databases listed above also contain texts from earlier periods.

Basque

Corpus Linguistico da Universidade de Vigo Includes a Basque/Spanish parallel corpus. (English page)
BasWN Basque WordNet
EDBL Lexical database of Basque

Catalan

CUB: El Corpus de Català Contemporani de la Universitat de Barcelona
Institut d'Estudis Calatans (IEC) 52 million word corpus of contemporary Catalan. (IEC also has a 21 million word PAROLE corpus.)

Czech

Czech National Corpus

French

ARTFL: Project for American and French Research on the Treasury of the French Language
Frida Corpus of learner French.

Galician

Corpus Linguistico da Universidade de Vigo (Parallel Corpora for Galician and English/French/Spanish) (English page)
Centro Ramón Piñeiro A corpus and other resources on Galician.
CORPORA DE LÍNGUA GALEGO-PORTUGUESA (Information about Galician and Portuguese corpora)
Corpus de referencia do galego actual
Corpus documentale latinum Gallaeciae

German

COSMAS (A large, searchable corpus of German, including some diachronic and spoken collections, as well as some matched East German/West German collections. It's possible to use part of this corpus for free, in sessions that are limited to 60 minutes.)
COSMAS wordlist 30,000 most frequent forms in the COSMAS corpus.
Project Gutenberg included 84 German texts as of 12/5/2000.
LAPT & DA Word and morpheme frequency lists for German, based on 7 corpora.

Italian

Badip Banco dati dell'Italiano parlato (databank of spoken Italian)

Hebrew

Japanese

EDR Electronic Dictionary of Japanese, organized into 11 sub-dictionaries.
IPAL Information-technology Promotion Agency Lexicon of the Japanese Language. (The link given here is for a page with Japanese on top and English further down.)
Electronic text databases:
Morphological Analyzer (tokenizer and pos tagger):
- ChaSen

Norwegian

The English-Norwegian Parallel Corpus

Portuguese

European Portuguese news text
CETEMPúblico a large corpus of Portuguese newspaper language
COMPARA Portuguese-English Parallel Translation Corpus (approx 65,000 words, web-searchable)
CORPORA DE LÍNGUA GALEGO-PORTUGUESA (Information about Galician and Portuguese corpora)

Slovene

The Slovene/English parallel corpus 1 million words of parallel Slovene-English / English-Slovene texts.

Russian

Tuebinger russische Korpora Russian Corpora, searchable via the web. Currently (5/2001) consists of the Uppsala corpus and a corpus of interviews from Russian online newspapers.
Subscribe.Ru Links to a subscription list about a collection of various dictionaries. The subscription list is searchable.
ssu.komi.com A collection of dictionaries on all sorts of subjects. Online search of a collection of science fiction stories.
vault.agava.ru Another collection of dictionaries (work in progress).

Serbo-Croatian

Yugoslav Corpus ~700,000 words of modern Yugoglav fiction, representing all of the Serbo-Croatian speaking areas of the former Yugoslavia. (Archived at CRL.)

Spanish

Corpus of Historical Spanish Texts (1200s-1900s, web-searchable)
Parallel Text in English and Spanish Texts from the Pan American Health Organization (Archived at CRL.)

Turkish

TS corpus
Turkish NLP Initiative
Turkish Text News feed from the Anatolian News Agency from September of 1992 plus other texts from popular sources. (Archived at CRL.)

Other Databases

US Census list of names with frequencies.

Emily M. Bender (bender at csli dot stanford dot edu)
Last modified: June 17, 2004

Corpus resources: Corpora and electronic text databases

Corpus resources:
Corpora and electronic text databases