Sociolinguistics Laboratory Corpus Holdings:

 

 

 

 

Type of Data

Register

Demographic Data

 

Name

*=requires special software (provided)

# CD-ROMs

Language(s)

Speech

Synt Phrase

Broadcast News Speech

Soundfile only

Soundfile (Waveform)

Textual

Transcribed

RP

WL (carrier)

Sent

 

IV

 

Cas

 

None

Gender

 

Age

Region

Education

CSR-IV Hub 4

3

English

 

 

 

 

OR

 

 

 

 

 

 

 

 

CELEX2*

1

Dutch,

Eng,

Ger

 

 

 

 

PH

 

 

 

 

 

 

 

 

 

CallFriend-English*

3

AmEng

 

 

 

 

 

 

 

 

 

 

 

CallFriend-Farsi*

3

Farsi

 

 

 

 

 

 

 

 

 

 

 

CallFriend-Hindi*

3

Hindi

 

 

 

 

 

 

 

 

 

 

 

CallFriend-Japanese*

3

Japanese

 

 

 

 

 

 

 

 

 

 

 

CallFriend-Korean*

3

Korean

 

 

 

 

OR

 

 

 

 

 

 

CallFriend-Tamil*

 

3

 

Tamil

 

 

 

 

 

 

 

 

 

 

 

Switchboard-2 Phase II

32

AmEng (InldN)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

TIMIT Acoustic-Phonetic Continuous Speech Corpus

1

Am Eng

 

 

 

OR/PH

 

 

 

 

 

 

Treebank-2

1

Eng

 

 

 

 

SC/AN

 

 

 

 

 

 

 

 

Urban Voices

1

 

Brit Eng

 

 

 

 

 

 

 

 

 


RP: Reading passage style

WL: Wordlist style (words in carrier frames)

Sent: Sentence  (a reading style)

IV: Interview style

Cas: Casual speech style

OR: Orthographic transcription

PH:Phonetic transcription

SC: Syntactic Constituents

AN: NLP Annotations


 

Notes regarding table:

1. Format of soundfiles is typically .wav or proprietary

2. CallFriend recordings are of speakers in dyads

 

 

 

 

Other places you can find corpora:

1. EE Library

2. Online – the Linguistics Data Consortium now makes a few of their most popular corpora available online (in searchable form) to guests who have no subscription to LDC otherwise.

            a. a great place to begin is with Emily Bender’s NWAVE31 webpage:

            "Corpus Methods for Sociolinguists" http://faculty.washington.edu/ebender/links.shtml

 

Typical Sociolinguistic Practice

1. Sociolinguist acquires her own or someone else’s data (large texts or soundfiles)

2. Transcribes or obtains a transcription of the data (e.g., entire corpus or extracted words)

3. Manually codes these data (phones, phonemes, part of speech, constituent tag) so they may be linked to a timestamp on a recording and to a category of interest (vowel identity, F1, F2) following the researcher’s dependent and independent variables

3. Sets up a database program to contain the measures taken from these data (phonetic, sociolinguistic, syntactic, phonological, etc.) as well as codes for speaker variables (age, gender, class, network score, etc.)

4. Sorts the data so particular forms of interest may be tallied

5. Creates summary tallies of the forms of interest (average frequencies of each speaker’s usage of (-t/d) deletion, for example)

6. Compares tallies obtained for each speaker or group of speakers

 

Corpora can provide the data...

Special software included in the cd’s can allow for querying, which can help you

...identify forms in the data

...find out the frequencies of forms in the data

...check preceding and following contexts for forms

...and much more!

 

 

Demonstrations

So, what’s on the corpus cd’s that we have?

 

CallFriend

Q: Why would I use CallFriend? A: You would find CallFriend useful if you desired unscripted conversational information.  You will find that there are two channels—one for each speaker in the conversation. To separate these channels, you need to use speech analysis software, such as Praat, that allows the data from multiple channels to be read separately. 

 

Q: Where do the CallFriend data come from? A: Recorded conversations between native speakers of the relevant language, recorded using telephones in the USA or overseas.

 

1. CD is in ISO 9660 format (both MAC and PC accessible format)

 

2. Go to a folder on the CD. There are three choices: “training”, “test”, and “evaluation”. It doesn’t really matter which you use for most purposes, as all three contain unscripted conversation.  Note:  the conversations are NOT transcribed.

 

3. Open DOC folder to learn about the contents of the cd

 

4. Opening the files in the file folder list:

            .tbl files may be opened in Excel

                 &n