The BNC's very large corpus of contemporary written and spoken English has only recently been released for distribution to North America. It can be accessed by (purchased) CDROMs (not what we are talking about here), on a trial basis using the downloadable interface SARA, and in a simplified way, online, where you get the first 50 lines using the queried word or phrase (unless there are fewer in all the corpus--then you get all). Each word of the corpus is tagged for Part of Speech (produced using the CLAWS automated tagger); the "parts of speech" used are several times more numerous than the schoolbook 8 or 9; the largest set of tags (144) gives the fewest ambiguous taggings. The entire corpus is tagged with TEI markup and has a very thorough and useful guide to corpus analysis written by xx and Lou Burnard. This is online and can also be downloaded for study or purchased as a book from Oxford University Press. (And check http://corp.hum.ou.dk/corpustop.html)
Comparable to the BNC in size, the HarperCollins Cobuild Direct accesses 50 million words from the Bank of English. It can be accessed on a demo basis via a Java applet. Here too you get the "first 50" citations, but you can select the subcorpus you want to search and in other ways restrict and filter the output. The words are tagged for part of speech, also automatically, with the ENGCG tagging program developed in Finland (see GramResources). The set of parts is much smaller and closer to the set from traditional grammar.
"WordNet® is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets." It is a "thesaurus" on psycholinguistic principles developed by George A. Miller and others at Princeton. The on-line version takes words as input and returns synonyms, hypernyms, (if nouns) coordinate words (words with same hypernym); if verbs, it also returns troponyms (instead of hypernyms) and entailments.
The celebrated Thinkmap (formerly Plumbdesign) Visual Thesaurus is a gorgeous front end for Wordnet. It takes words as inputs and returns rotating figures of connected synonyms, which can themselves be clicked on to branch synonyms .... Read through the information screens.
Lexical FreeNet "This program allows you to search for relationships between words, concepts, and people. It is a combination thesaurus, rhyming dictionary, pun generator, and concept navigator." LFN enhances the associations linked to a word from the purely lexical WordNet base by adding databases of people, things-with-their-parts, rhyming words and other stuff. It allows you to search for related words across a set number of links, and it computes the relatedness of two words via the linking words in the WordNet.
LSA computes lexical spaces for different types of texts and reduces those spaces statistically to locate the key vocabulary in a space of (usually) several hundred vectors. It follows that what is a related word in one domain of discourse (say, Cardiology) will not necessarily be related in another domain (MesoAmerican History or Literary Criticism). Try, for example, heart. (Databases for only a few domains are available on line.)
Any sample of text can then be matched against the normal vector space and the degree of its "standardness" can be computed. LSA's developers, headed by Walter Kintsch and Thomas K. Landauer at University of Colorado, claim that it is able to "recognise" student papers as proper to a particular disciplinary domain, and hence can function as an automatic paper grader. LSA goes far beyond a simple measure of the different lexical frequencies of words in different domains (i.e. a simple jargon-matcher) to measure the dependencies of word sets and chains. They claim to grade essays in the disciplines studied as reliably as ETS-trained graders. You can submit a paragraph and have it analyzed and graded. LSA inspires many who teach writing in the disciplines with extreme fear and loathing.
LSA will also compute the degree of lexical cohesion between pairs of sentences in connected text. Again, this is relative to a discursive domain, so that a text that may be quite cohesive in one domain, or in a general college-freshman reading level, may be less or more so in other domains.
RST, as presented on line by William Mann of the Summer Institute of Linguistics, focuses on connections between claues (one aspect of cohesion) both marked and unmarked (inferred). They provide a list of about 50 clause connections, a way of graphing the connections in a text, a text editor, and numerous examples. The connection graphing program can be downloaded and runs on most desktop platforms, but decisions about the connections is still left to human interpretation.