Corpus resources:
Corpora and electronic text databases
This page contains links to lists of available corpora and
descriptions of individual corpus projects. Because of the
nature of WWW, there is considertable overlap between some
of the lists. Some of the corpora linked to here are freely
available, others only for a fee.
NOTE This page is not actively maintained. For
more up-to-date information, you might try the ACL wiki page
on resources by language.
Jump to:
- CHILDES: Child Language Data Exchange System
- Silfide wordlists Word frequency lists for English, French and German based on the Silfide corpora.
- Silfide
Texts in French, English, German, Danish, Italian, Spanish and
Portuguese. Some of these texts appear to be translations, and many
appear in more than one language.
- UPF
corpus Texts (law, environment, medicine, economy and IT)
in Catalan, Spanish, French, English and German. (Online tools and demos) (Information in English)
- Corpus Linguistico da Universidade
de Vigo (Parallel Corpora for Galician and English/French/Spanish; also Spanish/Basque, English/Portuguese, and English/Spanish) (English page)
- EuroWordNet A mulitlingual database with wordnets for several European languages
- COSMAS (A large, searchable corpus of German, including
some diachronic and spoken collections, as well as some matched
East German/West German collections. It's possible to use part of
this corpus for free, in sessions that are limited to 60 minutes.)
- COSMAS wordlist 30,000 most frequent forms in the COSMAS corpus.
- Project Gutenberg included 84 German texts as of 12/5/2000.
- LAPT & DA Word and morpheme frequency lists for German, based on 7 corpora.
- Badip Banco dati dell'Italiano parlato (databank of spoken Italian)
- EDR Electronic Dictionary of Japanese, organized into 11 sub-dictionaries.
- IPAL Information-technology Promotion Agency
Lexicon of the Japanese Language. (The link given here is for a page with Japanese on top and English further down.)
- Electronic text databases:
- Morphological Analyzer (tokenizer and pos tagger):
- Tuebinger
russische Korpora Russian Corpora, searchable via the web.
Currently (5/2001) consists of the Uppsala corpus and a
corpus of interviews from Russian online newspapers.
- Subscribe.Ru
Links to a subscription list about a collection of various
dictionaries. The subscription list is searchable.
- ssu.komi.com A collection of
dictionaries on all sorts of subjects. Online search of a
collection of science fiction stories.
- vault.agava.ru Another collection of dictionaries (work
in progress).
- Yugoslav
Corpus ~700,000 words of modern Yugoglav fiction, representing all
of the Serbo-Croatian speaking areas of the former Yugoslavia.
(Archived at CRL.)
Emily M. Bender (bender at csli dot stanford dot edu)
Last modified: June 17, 2004