Online Resources for Studying English Syntax: Comparisons of Taggers and Parsers

Several interactive and downloadable frontends, part of speech (POS) taggers,syntactic parsers, and tree drawers and displayers are now available online.

A. Morphosyntactic taggers

These are programs that tag words in sentences with a grammatical category or Part of Speech. They differ according to the set of tags they use:

  1. CLAWS tag sets 5 (60 tags) (used for BNC and COCA) and 7 (159 tags, used for COCA and BNC2)
  2. ENGCG (Constraint Grammar) POS tags at Lingsoft.
  3. Penn Treebank tags

and the type of program ("AI") used:

  1. Rule-Based
  2. Probability-Based
    (Bi- and tri-gram, Markov, maximum entropy)

Here is Wikipedia's excellent overview of Part-of-speech tagging.

A. 1 CLAWS tag set(s)

The University Centre for Computer Corpus Research in Language at the University of Lancaster developed the CLAWS (Constituent Likelihood Automatic Word-tagging System) tagger program with several levels of delicacy. You can submit a paragraph of up to 300 words to the tagger and it will return a tagged version fairly quickly. You can choose coarser or finer tag set, using CLAWS 5 (60 parts of speech—used for bulk of BNC) or CLAWS 7 (over 160 parts—used for the BNC sampler, BNC World, and BNC XML). The on-line The BNC2 Manual: Guidelines to Wordclass Tagging is very useful, especially in criteria for hard cases.

A. 2 English Constraint Grammar tag set(s)

ENGCG Documentation.

The VISL project tags with Constraint Grammar tags, along with tense, case, and number information and grammatical function in the sentence. "Flat Structure" option actually returns a dependency parse. For VISL parsing, see below.

ConnexorConnexor's Machinese Phrase Tagger also uses a Constraint Grammar tagger. Its tags are spelled out as words, but the full strings of symbols can be found in the Machinese Syntax parser-grapher. Refer to table of Tag Descriptions.

A. 3. PennTree tags

The complete, detailed PennTree Guide to Part of Speech Tagging is here (31 pages).

TreeTagger produces vertical POS format tagging only with an enhanced Penntree tag set. (Tagger is trainable HMM-type. Works for other languages too. Available for Linux and a demo version for Windows.) Version 3.1 is quite impressive and is an entry in the Great PennTree Tagger Contest (below). . Here is an on-line interface. Downloaded version also has chunker. Binaries available for Linux and Windows, also a GUI for Windows. Also GUI for all platforms by Laurence Anthony

FreeLing has been developed by the TALP Research Center at the Polytechnic University of Catelona. It includes a tagger with on-line (limited) demo and is downloable for Linux/Unix.

LingPipe Uses large Brown tag set (82 lex tags plus punc) and is trained on Brown corpus. Two other demos are trained on bio-medical corpora. [Online demos seem to be down 03/2015].

SVMTool is a recent tagger using Support Vector Machines that claims very good accuracy. It is trained on WSJ corpus.

The Cognitive Computing Group at UIUC offers a demo tagger with color coding in its suite of NLP tools (which include Semantic Role (like FrameNet) labeling and Shallow Parsing into main phrases.

The SMILE Text Analyzer is in fact a POS tagger. It uses HMM and is very fast.

The Stanford NLP Group has put up java-based maximum entropy POS tagger that can tag large amounts of text. It tags each word of continuous text with a PennTree POS. Special feature: it has a much slower bidirectional mode as well as "left three words" mode of operation. Bidirectional scored very well on the Tagger Contest. No demo, but a cross-platform Java program. Package also has a chunker and a good parser ↓. Can also be selected in Antelope ↓ suite for MS Windows.

This set of java-based tools now housed at Apache provides tagger, chunker, and parser ↓. They test out very well. No demos.

Great PennTree Tagger Contest: Results of the second heat: Here are slightly edited taggings of Lincoln's Gettysburg Address by SVM Tool, OpenNLP, TreeTagger, Stanford Tagger, CCG, and FreeLing tagger, with CLAWS added for comparison, though with different tag set. (See Scoring Protocol and Parallel Results)

B. Syntactic Parsers and Tree Diagrammers

Here again there are major differences in the kinds of grammars:

  1. those highlighting grammatical relations (dependency grammars)
  2. those highlighting phrase structure trees
  3. and those that output both, facilitating comparisons

B. 1 dependency grammars

Connexor was founded by some of the Helsinki group and offers for sale parsing tools for several languages. For English, it uses an improved version of ENGCG () POS tagging with a nifty little java applet for a dependency tree display of grammatical relations (some of which look more like semantic relations). Key to the grammatical relations annotating the edges, Dependency functions.

Noah Smith's Ark Research Group now at the University of Washington has a demo of TurboParser, which implements a syntactic parsing and graphing of sentences in Stanford Dependency relations and, along with it, a FrameNet semantic parsing. The parse appears to be done directly into grammatical relations and not by conversion from a phrase structure parse (as with the Stanford Parser Core engine). Good explanations of the symbols and diagrams.

The Institute for Natural Language Processing at Uni Stuttgärt demonstrates a Semantic Role Labeler.
Dependency relation labels for it, Stanford, and other dependency parsers—and the usefulness of an intermediate parse into constitutent structures—are discussed in
Lingpeng Kong and Noah A.Smith: An Empirical Comparison of Parsing Methods for Stanford Dependencies;
Jinho D. Choi and Martha Palmer, Guidelines for the Clear Style Constituent to Dependency Conversion;
Marie-Catherine de Marneffe and Christopher D. Manning, Stanford typed dependencies manual

DGA—the dependency Grammar Annotator— is a little Java-based tool for drawing dependency trees with labels. The online demo offers one set of labels and relations; to customize the list, you have to download the DGA and change the Configuration file.

C&C Tools by James Curran and Stephen Clark can also be had as a downloadable package for Windows, Linux and Mac. It too will produce analysis in terms of grammatical dependency relations (in RASP set of relations). Try the demo with "GR" ticked.

VISL under Eckhard Bick has extensive tools for tagging, parsing, and graphing, and not just for English. It produces graphs like those on the right which have double labelling of each node as POS or "g" (group, or phrase) and as core function (S, P, Od, etc.) or as H(ead) or Dep(endent). (a node with a torn edge is one half of a discontinuous one—as P is here.) Gives multiple parsings.

In addition, it now can graph dependency relations and is a good way to learn to read the ENGCG function labels. Also has POS-tagged corpora.

B. 2 phrase structure grammars.

The Penn Treebank is a large corpus of articles from the Wall Street Journal that have been tagged with Penn Treebank tags and then parsed into properly bracketed trees according to a simple set of phrase structure rules conforming to Chomsky's Government and Binding syntax. See Building a large annotated corpus..., especially the latter part. For more extensive description, see Annotating Predicate Argument Structure The full (318 page) manual for PennTreebank II markup is available as a PDF

The first 10% Penn TreeBank sentences are available with both standard PennTree and also Dependency parsing as part of the free dataset for the Python-based Natural Language Tool Kit (NLTK).

Penn Treebank Online allows searching the WSJ Treebank (47K sentences) and two other corpora of machine-tagged sentences, 500K and 5M sentences from Wikipedia. These can be searched by word, phrase, or subtree phrasal configurations.

Clipslogo The latest incarnation of the Memory Based Tagger and Timbl learning software provides a shallow parser demo trained on either the Wall Street Journal corpus (for general English) or on a bio-medical corpus. It returns sentences marked up for POS, phrase, and some grammatical relations. 200 characters per pasting.

B. 3 sites with both kinds of outputs

The Stanford NLP Group's java-based Parser can compute and report a dependency equivalent of its constituent-structure-based parses. Their set of dependency relations is becoming widely known and is described here. The Stanford CoreNLP demo will take pasted in text and return POS tagging, dependency graphs, coreference links and Named Entity Recognition.

Also, Bernard Bou's Java-based GrammarScope uses the Stanford package and can display both Phrase Structure trees and grammatical relations as colors. VERY nice once you get the hang of it (you can paste sentences in from the clipboard as well as feed it text files). You can even edit the list and definition of the grammatical relations. Version 2! draws dependency arcs and other CoreNLP tasks (NER, coreference).

Antelope logoProxem Antelope is a package of taggers, chunkers, parsers, and graphers that can draw trees that are both PennTree constituent style and marked for grammatical relations (using the Stanford parser). Integrates much lexical info (WordNet, VerbNet, etc.) and provides some predicate/argument and semantic role analysis as well. It is written in C# for Windows and is free with a harmless registration. Ram hungry, but a nice piece of work. Provides multiple parses. No new versions since 2008.

University College London logoThe venerable Survey of English Usage at University College London weighs in with its contribution to the International Corpus of English, namely the International Corpus of English, or at least the British part of it. ICE-GB is a 100 million word corpus of contemporary English written and spoken in Britain, some of which can be downloaded for free and accessed with the free ICECUP tool. This corpus is not only marked up for part of speech; each part is also assigned a syntactic function following the Quirk et al. scheme of SVOA etc. So it displays its texts in trees (oriented side-, top-, or bottom-up as you please) with dual labelling of each node (see sample). This links up very well with the Oxford Grammar of English, which is based on the ICE-GB corpus for British English and a Wall Street Journal corpus for American English. In fact, if you have the ICE-GB corpus installed, you can check the diagram for any sentence in the Oxford English Grammar.

Internet Grammar of EnglishIn addition, the Survey of English Usage offers an online tutorial in English syntax of the double-layered kind used in ICE. It has self-correcting check-off quizzes and animations of syntactic movements.

Gene Moutoux Old Time Religion Gene Moutoux of Eastern High School in Louisville, KY,ret.,has put up extensive tutorial examples of sentences diagrammed according to Reed-Kellogg principles (1877 et seq.) For more than a century, this was sentence diagramming in America.

This is one of four sites of (on-line) Resources for English language study maintained by George Dillon, University of Washington. The others are:

Phonetics Resources Corpus Resources Semantics Resources

George L. Dillon
University of Washington
4 November 1999
Revised 10 March 2000
Again, 16 April 2000 Again, 12 February 2001 Again, 18 November 2001, January 2003, 2004, March 2005, May 2007, December 2009, November 2010, March 2015