Sociolinguistics
Laboratory Corpus Holdings:
|
|
|
|
Type of Data |
Register |
Demographic Data |
||||||||||||||
|
Name *=requires special
software (provided) |
#
CD-ROMs |
Language(s) |
Speech |
Synt Phrase |
Broadcast News Speech |
Soundfile only |
Soundfile (Waveform) |
Textual |
Transcribed |
RP |
WL (carrier) |
Sent |
IV |
Cas |
None |
Gender |
Age |
Region |
Education |
|
CSR-IV Hub 4 |
3 |
English |
√ |
|
√ |
|
√ |
|
OR |
√ |
|
|
√ |
|
|
|
|
|
|
|
CELEX2* |
1 |
Dutch, Eng, Ger |
|
√ |
|
|
|
√ |
PH |
|
|
|
|
|
√ |
|
|
|
|
|
CallFriend-English* |
3 |
AmEng |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
CallFriend-Farsi* |
3 |
Farsi |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
CallFriend-Hindi* |
3 |
Hindi |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
CallFriend-Japanese* |
3 |
Japanese |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
CallFriend-Korean* |
3 |
Korean |
√ |
|
|
√ |
|
|
OR |
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
CallFriend-Tamil* |
3 |
Tamil |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
Switchboard-2 Phase II |
32 |
AmEng (InldN) |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
|
|
|
|
|
TIMIT Acoustic-Phonetic
Continuous Speech Corpus |
1 |
Am Eng |
√ |
|
|
√ |
√ |
|
OR/PH |
|
|
√ |
|
|
|
√ |
√ |
√ |
|
|
Treebank-2 |
1 |
Eng |
|
√ |
|
|
|
√ |
SC/AN |
√ |
|
|
|
|
√ |
|
|
|
|
|
Urban Voices |
1 |
Brit Eng |
√ |
|
|
√ |
|
|
|
|
√ |
|
|
√ |
|
√ |
√ |
√ |
√ |
RP: Reading passage style
WL: Wordlist style (words
in carrier frames)
Sent: Sentence (a reading style)
IV: Interview style
Cas: Casual speech style
OR: Orthographic
transcription
PH:Phonetic transcription
SC: Syntactic
Constituents
AN: NLP Annotations
Notes
regarding table:
1.
Format of soundfiles is typically .wav or proprietary
2.
CallFriend recordings are of speakers in dyads
Other
places you can find corpora:
1. EE
Library
2.
Online – the Linguistics Data Consortium now makes a few of their most
popular corpora available online (in searchable form) to guests who have no
subscription to LDC otherwise.
a.
a great place to begin is with Emily Bender’s NWAVE31 webpage:
"Corpus
Methods for Sociolinguists" http://faculty.washington.edu/ebender/links.shtml
Typical
Sociolinguistic Practice
1. Sociolinguist acquires her own or
someone else’s data (large texts or soundfiles)
2. Transcribes or obtains a
transcription of the data (e.g., entire corpus or extracted words)
3. Manually codes
these data (phones, phonemes, part of speech, constituent tag) so they may be
linked to a timestamp on a recording and to a category of interest (vowel identity,
F1, F2) following the researcher’s dependent and independent variables
3. Sets up a database program to
contain the measures taken from these data (phonetic, sociolinguistic,
syntactic, phonological, etc.) as well as codes for speaker variables (age,
gender, class, network score, etc.)
4. Sorts the data so particular
forms of interest may be tallied
5. Creates summary tallies of the
forms of interest (average frequencies of each speaker’s usage of (-t/d)
deletion, for example)
6. Compares tallies obtained for
each speaker or group of speakers
Corpora can provide the data...
Special software included in the
cd’s can allow for querying, which can help you
...identify forms in the data
...find out the frequencies of forms
in the data
...check preceding and following
contexts for forms
...and much more!
Demonstrations
So,
what’s on the corpus cd’s that we have?
CallFriend
Q: Why
would I use CallFriend? A: You would find CallFriend useful if you desired
unscripted conversational information.
You will find that there are two channels—one for each speaker in
the conversation. To separate these channels, you need to use speech analysis
software, such as Praat, that allows the data from multiple channels to be read
separately.
Q: Where
do the CallFriend data come from? A: Recorded conversations between native
speakers of the relevant language, recorded using telephones in the USA or
overseas.
1. CD is
in ISO 9660 format (both MAC and PC accessible format)
2. Go to
a folder on the CD. There are three choices: “training”,
“test”, and “evaluation”. It doesn’t really
matter which you use for most purposes, as all three contain unscripted
conversation. Note: the conversations are NOT transcribed.
3. Open
DOC folder to learn about the contents of the cd
4.
Opening the files in the file folder list:
.tbl
files may be opened in Excel
&n