Corpus Methods for Sociolinguistics
Workshop at NWAV 31 - October 10, 2002
Workshop handout (4 up)
I intend to continue to maintain and update this page.
If you find software, corpora, or anything else of
interest to sociolinguists working with corpora, please
email me (bender at csli dot stanford dot edu).
New corpora become available all the time. The
following is a small sampling of those that might
be of particular interest to sociolinguists:
- DASL: Data and
Annotations for SocioLinguistics
- BNC: The British National
Corpus - 100 million words of written (90%) and spoken (10%) British
English. The spoken part includes informal, unscripted conversation
by speakers of different ages, regions, and social classes, as well as
spoken language from formal meetings, radio shows, phone-ins, and
other situations.
- ANC: The American
National Corpus (in progress) - modeled on the BNC, for American
English. The first installment of 10 million words is due to be
released before the end of the year (2002) through the LDC.
- ICE: International Corpus of English. Parallel
corpora of spoken and written English from 20 sites around the world.
- Switchboard (LDC):
Strangers speaking to each other over the telephone on randomly
select topics (speech files & transcripts, American English)
- CallHome(LDC):
Telephone conversations between close friends & family members.
(speech files & transcripts, many languages)
- CallFriend (LDC):
Like CallHome, more langauges, not (yet?) transcribed
- LIPPS
(TalkBank): Language Interaction
in Plurilingual and Plurilectal Speakers (codeswitching data)
- CHILDES
(TalkBank): Language
acquisition data (child and adult, first and second language).
- COLT: The Bergen Corpus
Of London Teenage Language
Places to look for corpora
For large quantities of English text, there is
also Project Gutenberg
Taggers and tokenizers
Searching
- BNCweb A beautiful
search interface for the BNC (World Edition). In principle, it should
be possible to use it with other corpora as well, provided that they
are formatted and marked up properly.
- TIGERSearch: Software for searching treebanks
Coding
- Goldsearch: Software for producing VARBRUL input files
Transcribing
- TalkBank software: CLAN, Transana,
Transcriber
If your data is stored in text files, and you have
access to Unix/Linux, there are some powerful and relatively
simple tools for manipulating those text files.
This page links to a page of
labs with answers that I made for a course on corpus linguists at
UC Berkeley in 2000.
grep and other unix commands
- tr 'translates' a string by replacing characters
with other characters that you specify
- sort sorts the lines in a file, according to
parameters that you specify
- uniq removes duplicate lines, if they are
adjacent
Click here for a tutorial on using unix
and in particular these commands.
The star of all unix commands, for linguistic purposes
anyway, is grep. Grep allows you to search a file
for any lines that match a regular expression which you specify.
Click here for a tutorial on using
grep, and its cousin, egrep.
Perl
Perl is a general-purpose programming language which
is particularly well-suited to text manipulation.
Click here for a tutorial on using
perl.
I also strongly recommend Learning Perl (3rd edition) by
Randal Schwartz and Tom Phoenix (2001).
Emily M. Bender (bender at csli dot stanford dot edu)
Last modified: