Sociolinguistics Laboratory Corpus Holdings:

 

 

 

 

Type of Data

Register

Demographic Data

 

Name

*=requires special software (provided)

# CD-ROMs

Language(s)

Speech

Synt Phrase

Broadcast News Speech

Soundfile only

Soundfile (Waveform)

Textual

Transcribed

RP

WL (carrier)

Sent

 

IV

 

Cas

 

None

Gender

 

Age

Region

Education

CSR-IV Hub 4

3

English

 

 

 

 

OR

 

 

 

 

 

 

 

 

CELEX2*

1

Dutch,

Eng,

Ger

 

 

 

 

PH

 

 

 

 

 

 

 

 

 

CallFriend-English*

3

AmEng

 

 

 

 

 

 

 

 

 

 

 

CallFriend-Farsi*

3

Farsi

 

 

 

 

 

 

 

 

 

 

 

CallFriend-Hindi*

3

Hindi

 

 

 

 

 

 

 

 

 

 

 

CallFriend-Japanese*

3

Japanese

 

 

 

 

 

 

 

 

 

 

 

CallFriend-Korean*

3

Korean

 

 

 

 

OR

 

 

 

 

 

 

CallFriend-Tamil*

 

3

 

Tamil

 

 

 

 

 

 

 

 

 

 

 

Switchboard-2 Phase II

32

AmEng (InldN)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

TIMIT Acoustic-Phonetic Continuous Speech Corpus

1

Am Eng

 

 

 

OR/PH

 

 

 

 

 

 

Treebank-2

1

Eng

 

 

 

 

SC/AN

 

 

 

 

 

 

 

 

Urban Voices

1

 

Brit Eng

 

 

 

 

 

 

 

 

 


RP: Reading passage style

WL: Wordlist style (words in carrier frames)

Sent: Sentence  (a reading style)

IV: Interview style

Cas: Casual speech style

OR: Orthographic transcription

PH:Phonetic transcription

SC: Syntactic Constituents

AN: NLP Annotations


 

Notes regarding table:

1. Format of soundfiles is typically .wav or proprietary

2. CallFriend recordings are of speakers in dyads

 

 

 

 

Other places you can find corpora:

1. EE Library

2. Online – the Linguistics Data Consortium now makes a few of their most popular corpora available online (in searchable form) to guests who have no subscription to LDC otherwise.

            a. a great place to begin is with Emily Bender’s NWAVE31 webpage:

            "Corpus Methods for Sociolinguists" http://faculty.washington.edu/ebender/links.shtml

 

Typical Sociolinguistic Practice

1. Sociolinguist acquires her own or someone else’s data (large texts or soundfiles)

2. Transcribes or obtains a transcription of the data (e.g., entire corpus or extracted words)

3. Manually codes these data (phones, phonemes, part of speech, constituent tag) so they may be linked to a timestamp on a recording and to a category of interest (vowel identity, F1, F2) following the researcher’s dependent and independent variables

3. Sets up a database program to contain the measures taken from these data (phonetic, sociolinguistic, syntactic, phonological, etc.) as well as codes for speaker variables (age, gender, class, network score, etc.)

4. Sorts the data so particular forms of interest may be tallied

5. Creates summary tallies of the forms of interest (average frequencies of each speaker’s usage of (-t/d) deletion, for example)

6. Compares tallies obtained for each speaker or group of speakers

 

Corpora can provide the data...

Special software included in the cd’s can allow for querying, which can help you

...identify forms in the data

...find out the frequencies of forms in the data

...check preceding and following contexts for forms

...and much more!

 

 

Demonstrations

So, what’s on the corpus cd’s that we have?

 

CallFriend

Q: Why would I use CallFriend? A: You would find CallFriend useful if you desired unscripted conversational information.  You will find that there are two channels—one for each speaker in the conversation. To separate these channels, you need to use speech analysis software, such as Praat, that allows the data from multiple channels to be read separately. 

 

Q: Where do the CallFriend data come from? A: Recorded conversations between native speakers of the relevant language, recorded using telephones in the USA or overseas.

 

1. CD is in ISO 9660 format (both MAC and PC accessible format)

 

2. Go to a folder on the CD. There are three choices: “training”, “test”, and “evaluation”. It doesn’t really matter which you use for most purposes, as all three contain unscripted conversation.  Note:  the conversations are NOT transcribed.

 

3. Open DOC folder to learn about the contents of the cd

 

4. Opening the files in the file folder list:

            .tbl files may be opened in Excel

                        -- .tbl files contain demographic information on calls and speakers

            .doc files may be opened in MS Word

                        -- .doc files contain explanation of the codes used in .tbl files

 

5. example:

                       a. create a COPY of the file you wish to open ON THE LOCAL HARD DISK (this is important because the files on the cd are “read only” and Excel will not allow you to manipulate data in a read-only file. If you wish, for example, to add a transcription tier in Praat, you will need to open a local copy of the soundfile.

            b. open  Excel

            c. open the copy of the .tbl file you created in (a.) (SpeakerInfo.tbl)

1.  (MAC) Use “Import...” under the “File” menu to open and format the file so that each field occurs in a separate column

2. (PC) Use “Get External Data” under “Data” menu to do this.

3. In both import menus, it will be necessary to tell the “text import wizard” that the columns are delimited with a pipe (“|”).

d. add the column headers from the speakerinfo.doc or callinfo.doc files

 

e. create the desired column headers for the measurements that you will take from the recordings on the cd

 

f. open a soundfile from the CD_<lang> folder

            --using Sphere files

            --Sphere files can be opened in Praat. It will be necessary for you to

tell Praat to “Read two sounds from stereo file”.

 

g. since the files are not transcribed, you may need to devise some means of

obtaining a transcription

 

h. Note: in Praat, you can devise a macro for automatic extraction of data,

and create  a log file that you can append to the Excel file.

 

 

 

 

 

 

 

CELEX

Q: Why would I use CELEX? A: You would find CELEX useful if you desired to access lexical information on Dutch, German and English. 

 

Q: Where do the data in CELEX come from? A: Typically, dictionaries of the various languages. For example, data for Dutch come from a dictionary of contemporary Dutch, a widely-used wordlist, and a large text corpus.

 

1. CD is in ISO 9660 format (both MAC and PC accessible format)

 

2. Go to a language folder on the CD.

 

3. Open readme.doc to learn about the contents of that folder

 

4. Opening the files in the file folder list:

The files were created for use in the UNIX platform. However, the files, conveniently, are in ascii format, and can be opened by software such as Excel for the non-tech user, or using SQL database software for the techie (this is the software in which the databases were created). Data may be queried using special software called “AWK.” AWK allows the appending of files for combined querying of multiple sources.

e.g., DOW: orthographic information, including std and non-standard spellings. There are syllable frequency counts, phone frequencies, etc.

 

** What’s a “lemma”?  def. (Note: mathematicians, logicians and linguists use this a different way) “a premise”; “a secondary proposition”; “a theme or subject.” But, for the linguist, “a lexical form having the same stem, part of speech and word sense”

 

File folders:

dol     Dutch Orthography, Lemmas

           dpl     Dutch Phonology, Lemmas

           dml     Dutch Morphology, Lemmas

           dsl     Dutch Syntax, Lemmas

           dfl     Dutch Frequency, Lemmas

           dow     Dutch Orthography, Wordforms

           dpw     Dutch Phonology, Wordforms

           dmw     Dutch Morphology, Wordforms

           dfw     Dutch Frequency, Wordforms

           dct     Dutch INL Corpus Types

           dab     Dutch Abbreviations

           dfs     Dutch Frequency, Syllables

 

5. example:

                       a. create a COPY of the file you wish to open ON THE LOCAL HARD DISK (this is important because the files on the cd are “read only” and Excel will not allow you to manipulate data in a read-only file.

            b. open  Excel

            c. open the copy of the .cd file you created in (a.) (EPL.CD)

1.  (MAC) Use “Import...” under the “File” menu to open and format the file

so that each field occurs in a separate column

2. (PC) Use “Get External Data” under “Data” menu to do this.

3. In both import menus, it will be necessary to tell the “text import wizard”

that the columns are delimited with a backsplash (“\”).

d. add the column headers (from the readme.doc file)

e. continue with your data processing