This is Syntext--the SGML Grammar Grapher!

SYNTEXT is an SGML DTD providing elements and attributes to mark up text in English for

Any text marked up for these features and identifying itself as DOCTYPE SYNTEXT is an SGML document and can be browsed in a SGML browser or viewer such as SoftQuad's free Windows browser Panorama or the costwish viewer for X Windows being developed by Peter Murray-Rust. It is an SGML application, the purpose of which is to provide markup for the analysis of syntactic and textual structure; a marked up text can viewed as a tree and in other modes and can be searched with context sensitive and contingent scans, making it very powerful for stylistic analysis (once a passage is marked up!) The DTD is easily modifiable and can be very useful as a test bed for rule writing in syntax courses. The X Window version can use Nick Ing-Simmons' Text To Speech synthesizer rsynth to speak any word, phrase, or sentence displayed by clicking on it.

Viewer snapshots

Syntext in Panorama

This for example is a reduced snapshot of a sample paragraph in Panorama; on the left, you can see the default tree navigator display. This is a sort of sideways oriented constituent structure tree diagram. Clicking on a node in the diagram will highlight the string of words in the text contained in that node. The right side has a special coreference display switched on (default is plain ascii text) that shows pronouns and other anaphoric NPPs and their antecedents (flagged here with the little blue target). Clicking on one of the underlined anaphoric phrases will highlight the antecedent NPP (which immediately follows its "target").

Syntext in costwish

This is the same sentence displayed in costwish. Here SYNTEXT has its own menu allowing you to select to see or hear text and to "say" it. The nodes in the tree are all active. The tree can be customized via the GIs button to Coreference is indicated by means of arrows. In the Table of Contents display to the left nodes can be expanded or contracted by clicking "+" or "-".

The complete package includes HTML-based Help documentation of the DTD, sample passages, and style sheets (and navigators) to control the display in Panorama.

Some directions, notes and shoptalk follow. But click, if you would rather,

Using SYNTEXT to parse and mark up passages (assuming you know at least a little about X bar syntax).

It is very difficult to do SYNTEXT markup without an SGML-aware editor such as psgml, Lennart Staflin's lisp package for emacs/xemacs. It and emacs now run under Windows 95. If you use psgml, you will want James Clark's sgmls parser as well (aka "nsgmls", SP).

How to create a marked up text using psgml

First, load psgml via "M-x sgml-mode" (Note well: there is another sgml-mode packaged with emacs, so be sure you have installed psgml over it, along with some initializing lines in .emacs (or _emacs in Windows 95)). Now put the document identifer on the top line: <!DOCTYPE SYNTEXT SYSTEM "/yourpath/syntext.dtd"> . Clicking on this should cause the dtd to load. If not (and you are using xemacs), pull down the DTD menu and click "Parse DTD". You should now be fully enabled with smart markup--only the valid tags at any given point will be offered to you. Using psgml takes some practice, but it is very powerful and worth the effort. Use the "validate" command on the SGML menu every sentence to get sgmls to clear you before going on.

Notes and protocols

1 Syntactic markup

The basic syntactic markup is an X-bar system drawn from current texts of Generative Syntax. The rules are not designed to filter out ungrammatical strings but to assign an at least plausible surface structure parsing to sentences of printed text. They include HEAD and HEADP and HEADPP projections of HEAD where HEAD can be N, V, C, ADV, ADJ, PREP, and INFL. This generates many non-branching nodes. To make a viewable display, non-branching nodes are pruned so that the lexical item appears attached to the higher node (e.g. [NPP]-[NP]-[N]-people becomes [NPP]-people).

The HEAD categories are all obligatory but may be empty, as in the case of lNFL (and C) and ellipsed (or conjunction reduced) catgories. The two exceptions to this are the rules allowing NPP and ADVPP to be expanded as CPP (i.e., to allow a clause as the exponent of noun and adverb double bar phrases).

1.1 referential indices

NPP and CPP have referential indices ("id" and "ana" attributes); these are optional, but good practice is to assign every NPP either an "id" index or an "ana" to another indexed NPP or CPP.

1.2 prepositional phrases

Prepositional phrases are assigned the grammatical relations of prepcomp or comps (Subject Complement) or one of adjunct/conjunct/disjunct. Prep(ositional) comp(lements) are the lexically subcategorized ones whose semantic role is set by the head of the phrase; PrepPhrases should only be assigned the prepcomp function when in a HEADP constituent; adjuncts can appear in either HEADP or HEADPP nodes.

1.3 small categories

Three categories have a single standard fixed exponent (lNF=to, PRO=PRO, TRA=t); this is indicated in the attributes of these elements and should be entered as character data. These will be invisible in Panorama except under the Coreference and Grammatical Relations style sheets.

1.4 required attributes

Markup specifying the grammatical relation of the categories NPP, ADVPP, and PPP is required. (Markup for the grammatical relation of a CPP is available with the optional values CPPComp, CPPAdjunct, and CPPMain.) A good SGML editor like PSGML will not complete the markup until a required attribute has been specified. The floating tag FIG for rhetorical figure requires the type of the figure to be specified. in this case, as with the small categories, the selected type can be entered as character data. This is done for figures so that they can be extracted by a navigator sheet and displayed in a left hand frame. FIG is an experiment with "inclusions" and is not to be interpreted as a constituent of the sentence as the other categories are.

1.5 complementizers

There are two "comp" nodes, SPEC (of the CPP) and C (head of the CP); both are filled in a sentence such as What can he be thinking? The form that at the beginning of a relative clause is treated as a C exponent and not a pronoun; hence it has no index. When there is no true relative pronoun at the beginning of the clause (i.e. with that relatives), the trace is coreferenced to the head of the relative clause construction.

2 Cohesion

2.1 Coreference

As noted, either a referential index or a link to such an index on another element (an anaphoric link-- IDREF) may be assigned to each NPP (and may be assigned to CPP, for sentential antecedents). The "expletives" it and there are treated as NPP without reference and should be assigned dummy indices or none at all. Each index assigned to a (non-coreferential) NPP must be unique in the document.

2.2 Conjunction

Markup in the form of a conjrel attribute is provided in CPP SPECs. These are used to mark the relation of the CPP clause to the preceding one. it should not be used with truly subordinate CPP. The relations can be either expressed (i.e. with a conjunct) or implied, and in the latter case are tagged with an initial l. So there are eight possible choices: Addit(ive), Adver(sative), Caus(al), and Temp(oral) and the parallel inferred set lAddit, etc. Marking inferred relations goes beyond Halliday and Hasan's treatment of conjunction.

2.3 Lexical Cohesion

Markup for indicating lexical cohesion is provided as the attribute lexcoh for all major categories. Markup can co-mark pairs of identical words, or synonyms, or members of a lexical field, and can then look to merge pairs into threads and families of related words. NPPs containing nouns marked to link cohesively may not be coreferential: lexical cohesion is a matter of words, not things.

3 Figures

The traditional figures of words and thought are more manouvers (or tropes) than elements of textual or grammatical structure. They are not anchored to any particular category and so are represented as "inclusions" --tags which can appear anywhere without limitation. The tag FIG has traditional names of figures listed as values for the attribute type, and the correct value should also be typed in as PCDATA in the tag, so that it can be tabulated. At present, the figures supplied are anaphora (in sense of repetition), antithesis, asyndeton, correctio, diacope, hyperbold, ironia, isocolon, metonomy, polyptoton, sermocinatio,simile, synecdoche).

George L. Dillon
12 May 1996; revised 25 Jan 1999