Sociolinguistics
Laboratory Corpus Holdings:
|
|
|
|
Type of Data |
Register |
Demographic Data |
||||||||||||||
|
Name *=requires special
software (provided) |
#
CD-ROMs |
Language(s) |
Speech |
Synt Phrase |
Broadcast News Speech |
Soundfile only |
Soundfile (Waveform) |
Textual |
Transcribed |
RP |
WL (carrier) |
Sent |
IV |
Cas |
None |
Gender |
Age |
Region |
Education |
|
CSR-IV Hub 4 |
3 |
English |
√ |
|
√ |
|
√ |
|
OR |
√ |
|
|
√ |
|
|
|
|
|
|
|
CELEX2* |
1 |
Dutch, Eng, Ger |
|
√ |
|
|
|
√ |
PH |
|
|
|
|
|
√ |
|
|
|
|
|
CallFriend-English* |
3 |
AmEng |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
CallFriend-Farsi* |
3 |
Farsi |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
CallFriend-Hindi* |
3 |
Hindi |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
CallFriend-Japanese* |
3 |
Japanese |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
CallFriend-Korean* |
3 |
Korean |
√ |
|
|
√ |
|
|
OR |
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
CallFriend-Tamil* |
3 |
Tamil |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
√ |
√ |
|
√ |
|
Switchboard-2 Phase II |
32 |
AmEng (InldN) |
√ |
|
|
√ |
|
|
|
|
|
|
|
√ |
|
|
|
|
|
|
TIMIT Acoustic-Phonetic
Continuous Speech Corpus |
1 |
Am Eng |
√ |
|
|
√ |
√ |
|
OR/PH |
|
|
√ |
|
|
|
√ |
√ |
√ |
|
|
Treebank-2 |
1 |
Eng |
|
√ |
|
|
|
√ |
SC/AN |
√ |
|
|
|
|
√ |
|
|
|
|
|
Urban Voices |
1 |
Brit Eng |
√ |
|
|
√ |
|
|
|
|
√ |
|
|
√ |
|
√ |
√ |
√ |
√ |
RP: Reading passage style
WL: Wordlist style (words
in carrier frames)
Sent: Sentence (a reading style)
IV: Interview style
Cas: Casual speech style
OR: Orthographic
transcription
PH:Phonetic transcription
SC: Syntactic
Constituents
AN: NLP Annotations
Notes
regarding table:
1.
Format of soundfiles is typically .wav or proprietary
2.
CallFriend recordings are of speakers in dyads
Other
places you can find corpora:
1. EE
Library
2.
Online – the Linguistics Data Consortium now makes a few of their most
popular corpora available online (in searchable form) to guests who have no
subscription to LDC otherwise.
a.
a great place to begin is with Emily Bender’s NWAVE31 webpage:
"Corpus
Methods for Sociolinguists" http://faculty.washington.edu/ebender/links.shtml
Typical
Sociolinguistic Practice
1. Sociolinguist acquires her own or
someone else’s data (large texts or soundfiles)
2. Transcribes or obtains a
transcription of the data (e.g., entire corpus or extracted words)
3. Manually codes
these data (phones, phonemes, part of speech, constituent tag) so they may be
linked to a timestamp on a recording and to a category of interest (vowel identity,
F1, F2) following the researcher’s dependent and independent variables
3. Sets up a database program to
contain the measures taken from these data (phonetic, sociolinguistic,
syntactic, phonological, etc.) as well as codes for speaker variables (age,
gender, class, network score, etc.)
4. Sorts the data so particular
forms of interest may be tallied
5. Creates summary tallies of the
forms of interest (average frequencies of each speaker’s usage of (-t/d)
deletion, for example)
6. Compares tallies obtained for
each speaker or group of speakers
Corpora can provide the data...
Special software included in the
cd’s can allow for querying, which can help you
...identify forms in the data
...find out the frequencies of forms
in the data
...check preceding and following
contexts for forms
...and much more!
Demonstrations
So,
what’s on the corpus cd’s that we have?
CallFriend
Q: Why
would I use CallFriend? A: You would find CallFriend useful if you desired
unscripted conversational information.
You will find that there are two channels—one for each speaker in
the conversation. To separate these channels, you need to use speech analysis
software, such as Praat, that allows the data from multiple channels to be read
separately.
Q: Where
do the CallFriend data come from? A: Recorded conversations between native
speakers of the relevant language, recorded using telephones in the USA or
overseas.
1. CD is
in ISO 9660 format (both MAC and PC accessible format)
2. Go to
a folder on the CD. There are three choices: “training”,
“test”, and “evaluation”. It doesn’t really
matter which you use for most purposes, as all three contain unscripted
conversation. Note: the conversations are NOT transcribed.
3. Open
DOC folder to learn about the contents of the cd
4.
Opening the files in the file folder list:
.tbl
files may be opened in Excel
--
.tbl files contain demographic information on calls and speakers
.doc
files may be opened in MS Word
--
.doc files contain explanation of the codes used in .tbl files
5.
example:
a.
create a COPY of the file you wish to open ON THE LOCAL HARD DISK (this is
important because the files on the cd are “read only” and Excel
will not allow you to manipulate data in a read-only file. If you wish, for
example, to add a transcription tier in Praat, you will need to open a local
copy of the soundfile.
b.
open Excel
c.
open the copy of the .tbl file you created in (a.) (SpeakerInfo.tbl)
1. (MAC) Use “Import...” under the
“File” menu to open and format the file so that each field occurs
in a separate column
2. (PC) Use “Get External
Data” under “Data” menu to do this.
3. In both import menus, it will be
necessary to tell the “text import wizard” that the columns are
delimited with a pipe (“|”).
d. add the column headers from the
speakerinfo.doc or callinfo.doc files
e. create the desired
column headers for the measurements that you will take from the recordings on
the cd
f. open a soundfile from the
CD_<lang> folder
--using
Sphere files
--Sphere
files can be opened in Praat. It will be necessary for you to
tell Praat to “Read two sounds
from stereo file”.
g. since the files are not
transcribed, you may need to devise some means of
obtaining a transcription
h. Note: in Praat, you can devise a
macro for automatic extraction of data,
and create a log file that you can append to the Excel file.
CELEX
Q: Why
would I use CELEX? A: You would find CELEX useful if you desired to access
lexical information on Dutch, German and English.
Q: Where
do the data in CELEX come from? A: Typically, dictionaries of the various
languages. For example, data for Dutch come from a dictionary of contemporary
Dutch, a widely-used wordlist, and a large text corpus.
1. CD is
in ISO 9660 format (both MAC and PC accessible format)
2. Go to
a language folder on the CD.
3. Open
readme.doc to learn about the contents of that folder
4.
Opening the files in the file folder list:
The
files were created for use in the UNIX platform. However, the files,
conveniently, are in ascii format, and can be opened by software such as Excel
for the non-tech user, or using SQL database software for the techie (this is
the software in which the databases were created). Data may be queried using
special software called “AWK.” AWK allows the appending of files
for combined querying of multiple sources.
e.g., DOW: orthographic information, including std and
non-standard spellings. There are syllable frequency counts, phone frequencies,
etc.
** What’s a “lemma”? def. (Note: mathematicians, logicians
and linguists use this a different way) “a premise”; “a
secondary proposition”; “a theme or subject.” But, for the
linguist, “a lexical form having the same stem, part of speech and word
sense”
File folders:
dol
Dutch Orthography, Lemmas
dpl Dutch
Phonology, Lemmas
dml Dutch
Morphology, Lemmas
dsl Dutch
Syntax, Lemmas
dfl Dutch
Frequency, Lemmas
dow Dutch
Orthography, Wordforms
dpw Dutch
Phonology, Wordforms
dmw Dutch Morphology, Wordforms
dfw Dutch
Frequency, Wordforms
dct Dutch
INL Corpus Types
dab Dutch
Abbreviations
dfs Dutch
Frequency, Syllables
5.
example:
a.
create a COPY of the file you wish to open ON THE LOCAL HARD DISK (this is
important because the files on the cd are “read only” and Excel
will not allow you to manipulate data in a read-only file.
b.
open Excel
c.
open the copy of the .cd file you created in (a.) (EPL.CD)
1. (MAC) Use “Import...” under the
“File” menu to open and format the file
so that each field occurs in a separate column
2. (PC) Use “Get External
Data” under “Data” menu to do this.
3. In both import menus, it will be
necessary to tell the “text import wizard”
that the columns are delimited with
a backsplash (“\”).
d. add the column headers (from the
readme.doc file)
e. continue with your data
processing