Resources for Studying English Words and Usage, Semantics, and Textual Structure

A. Corpora of Contemporary English Usage

British National Corpus The BNC's very large corpus of contemporary written and spoken English has only recently been released for distribution to North America. It can be accessed by (purchased) CDROMs (not what we are talking about here), on a trial basis using the downloadable interface SARA, and in a simplified way, online, where you get the first 50 lines using the queried word or phrase (unless there are fewer in all the corpus--then you get all). Each word of the corpus is tagged for Part of Speech (produced using the CLAWS automated tagger); the "parts of speech" used are several times more numerous than the schoolbook 8 or 9; the largest set of tags (144) gives the fewest ambiguous taggings. The entire corpus is tagged with TEI markup and has a very thorough and useful guide to corpus analysis written by xx and Lou Burnard. This is online and can also be downloaded for study or purchased as a book from Oxford University Press. (And check

Comparable to the BNC in size, the HarperCollins Cobuild Direct accesses 50 million words from the Bank of English. It can be accessed on a demo basis via a Java applet. Here too you get the "first 50" citations, but you can select the subcorpus you want to search and in other ways restrict and filter the output. The words are tagged for part of speech, also automatically, with the ENGCG tagging program developed in Finland (see GramResources). The set of parts is much smaller and closer to the set from traditional grammar.


C. Latent Semantic Analysis

Latent Semantic Analysis logo LSA computes lexical spaces for different types of texts and reduces those spaces statistically to locate the key vocabulary in a space of (usually) several hundred vectors. It follows that what is a related word in one domain of discourse (say, Cardiology) will not necessarily be related in another domain (MesoAmerican History or Literary Criticism). Try, for example, heart. (Databases for only a few domains are available on line.)

Any sample of text can then be matched against the normal vector space and the degree of its "standardness" can be computed. LSA's developers, headed by Walter Kintsch and Thomas K. Landauer at University of Colorado, claim that it is able to "recognise" student papers as proper to a particular disciplinary domain, and hence can function as an automatic paper grader. LSA goes far beyond a simple measure of the different lexical frequencies of words in different domains (i.e. a simple jargon-matcher) to measure the dependencies of word sets and chains. They claim to grade essays in the disciplines studied as reliably as ETS-trained graders. You can submit a paragraph and have it analyzed and graded. LSA inspires many who teach writing in the disciplines with extreme fear and loathing.

LSA will also compute the degree of lexical cohesion between pairs of sentences in connected text. Again, this is relative to a discursive domain, so that a text that may be quite cohesive in one domain, or in a general college-freshman reading level, may be less or more so in other domains.

D. Rhetorical Structure Theory

RST, as presented on line by William Mann of the Summer Institute of Linguistics, focuses on connections between claues (one aspect of cohesion) both marked and unmarked (inferred). They provide a list of about 50 clause connections, a way of graphing the connections in a text, a text editor, and numerous examples. The connection graphing program can be downloaded and runs on most desktop platforms, but decisions about the connections is still left to human interpretation.

Return to Phonetics Resources
Return to Grammatical Resources

Work in Progress
George L. Dillon
May30, 2001