Corpus Methods for Sociolinguistics

Emily M. Bender

Workshop at NWAV 31 - October 10, 2002


Workshop handout (4 up)

I intend to continue to maintain and update this page. If you find software, corpora, or anything else of interest to sociolinguists working with corpora, please email me (bender at csli dot stanford dot edu).



Corpora of interest

New corpora become available all the time. The following is a small sampling of those that might be of particular interest to sociolinguists:

Places to look for corpora

For large quantities of English text, there is also Project Gutenberg


Software for accessing and analyzing corpora

Taggers and tokenizers

Searching

Coding

Transcribing


Basic programming tools

If your data is stored in text files, and you have access to Unix/Linux, there are some powerful and relatively simple tools for manipulating those text files.

This page links to a page of labs with answers that I made for a course on corpus linguists at UC Berkeley in 2000.

grep and other unix commands

Click here for a tutorial on using unix and in particular these commands.

The star of all unix commands, for linguistic purposes anyway, is grep. Grep allows you to search a file for any lines that match a regular expression which you specify.

Click here for a tutorial on using grep, and its cousin, egrep.

Perl

Perl is a general-purpose programming language which is particularly well-suited to text manipulation.

Click here for a tutorial on using perl.

I also strongly recommend Learning Perl (3rd edition) by Randal Schwartz and Tom Phoenix (2001).


Creating and publishing corpora

-----
Emily M. Bender (bender at csli dot stanford dot edu)
Last modified: