SYNTEXT is an SGML DTD providing elements and attributes to mark up text in English for
Any text marked up for these features and identifying itself as DOCTYPE SYNTEXT is an SGML document and can be browsed in a SGML browser
or viewer such as SoftQuad's free Windows browser Panorama or the
costwish viewer for X Windows being developed by Peter Murray-Rust. It is an SGML application, the purpose
of which is to provide markup for the analysis of syntactic and textual structure; a marked up text can viewed
as a tree and in other modes and can be searched with context sensitive and contingent scans, making it very powerful for stylistic analysis (once a passage is marked up!) The DTD is easily
modifiable and can be very useful as a test bed for rule writing in syntax courses. The X Window version can use Nick Ing-Simmons' Text To Speech synthesizer
rsynth to speak any word, phrase, or sentence displayed by clicking on it.
This for example is a reduced snapshot of a sample paragraph in Panorama; on the left, you can see the default tree navigator display. This is a sort of sideways oriented constituent structure tree diagram. Clicking on
a node in the diagram will highlight the string of words in the text contained in that node. The right side has a special coreference display switched on (default is plain ascii text) that shows pronouns and other anaphoric NPPs and their antecedents (flagged here with the little blue target). Clicking on one of the underlined anaphoric phrases will highlight the antecedent NPP (which immediately follows its "target").
This is the same sentence displayed in costwish. Here SYNTEXT has its own menu allowing you to select to see or hear text and to "say" it. The nodes in the tree are all active. The tree can be customized via the GIs button to Coreference is indicated by means of arrows. In the Table of Contents display to the left nodes can be expanded or contracted by clicking "+" or "-".
The complete package includes HTML-based Help documentation of the DTD, sample passages, and style sheets (and navigators) to control the display in Panorama.
Some directions, notes and shoptalk follow. But click, if you would rather,
psgml, Lennart Staflin's lisp package for
emacs/xemacs. It and
emacsnow run under Windows 95. If you use
psgml, you will want James Clark's sgmls parser as well (aka "nsgmls", SP).
psgml via "M-x sgml-mode" (Note well: there is another sgml-mode packaged with emacs, so be
sure you have installed
psgml over it, along with some initializing lines in
_emacs in Windows 95)). Now put the document identifer on the top line: <!DOCTYPE SYNTEXT SYSTEM "/yourpath/syntext.dtd">
. Clicking on this should cause the dtd to load. If not (and you are using
xemacs), pull down the
DTD menu and
click "Parse DTD". You should now be fully enabled with smart markup--only the valid tags at any given point will be
offered to you. Using
psgml takes some practice, but it is very powerful and worth the effort. Use the
"validate" command on the SGML menu every sentence to get
sgmls to clear you before going on.
The basic syntactic markup is an X-bar system drawn from current texts of Generative Syntax. The rules are not designed to filter out ungrammatical strings but to assign an at least plausible surface structure parsing to sentences of printed text. They include HEAD and HEADP and HEADPP projections of HEAD where HEAD can be N, V, C, ADV, ADJ, PREP, and INFL. This generates many non-branching nodes. To make a viewable display, non-branching nodes are pruned so that the lexical item appears attached to the higher node (e.g. [NPP]-[NP]-[N]-people becomes [NPP]-people).
The HEAD categories are all obligatory but may be empty, as in the case of lNFL (and C) and ellipsed (or conjunction reduced) catgories. The two exceptions to this are the rules allowing NPP and ADVPP to be expanded as CPP (i.e., to allow a clause as the exponent of noun and adverb double bar phrases).
NPP and CPP have referential indices ("id" and "ana" attributes); these are optional, but good practice is to assign every NPP either an "id" index or an "ana" to another indexed NPP or CPP.
Prepositional phrases are assigned the grammatical relations of prepcomp or comps (Subject Complement) or one of adjunct/conjunct/disjunct. Prep(ositional) comp(lements) are the lexically subcategorized ones whose semantic role is set by the head of the phrase; PrepPhrases should only be assigned the prepcomp function when in a HEADP constituent; adjuncts can appear in either HEADP or HEADPP nodes.
Three categories have a single standard fixed exponent (lNF=to, PRO=PRO, TRA=t); this is indicated in the attributes of these elements and should be entered as character data. These will be invisible in Panorama except under the Coreference and Grammatical Relations style sheets.
Markup specifying the grammatical relation of the categories NPP, ADVPP, and PPP is required. (Markup for the grammatical relation of a CPP is available with the optional values CPPComp, CPPAdjunct, and CPPMain.) A good SGML editor like PSGML will not complete the markup until a required attribute has been specified. The floating tag FIG for rhetorical figure requires the type of the figure to be specified. in this case, as with the small categories, the selected type can be entered as character data. This is done for figures so that they can be extracted by a navigator sheet and displayed in a left hand frame. FIG is an experiment with "inclusions" and is not to be interpreted as a constituent of the sentence as the other categories are.
There are two "comp" nodes, SPEC (of the CPP) and C (head of the CP); both are filled in a sentence such as What can he be thinking? The form that at the beginning of a relative clause is treated as a C exponent and not a pronoun; hence it has no index. When there is no true relative pronoun at the beginning of the clause (i.e. with that relatives), the trace is coreferenced to the head of the relative clause construction.
As noted, either a referential index or a link to such an index on another element (an anaphoric link-- IDREF) may be assigned to each NPP (and may be assigned to CPP, for sentential antecedents). The "expletives" it and there are treated as NPP without reference and should be assigned dummy indices or none at all. Each index assigned to a (non-coreferential) NPP must be unique in the document.
Markup in the form of a conjrel attribute is provided in CPP SPECs. These are used to mark the relation of the CPP clause to the preceding one. it should not be used with truly subordinate CPP. The relations can be either expressed (i.e. with a conjunct) or implied, and in the latter case are tagged with an initial l. So there are eight possible choices: Addit(ive), Adver(sative), Caus(al), and Temp(oral) and the parallel inferred set lAddit, etc. Marking inferred relations goes beyond Halliday and Hasan's treatment of conjunction.
Markup for indicating lexical cohesion is provided as the attribute lexcoh for all major categories. Markup can co-mark pairs of identical words, or synonyms, or members of a lexical field, and can then look to merge pairs into threads and families of related words. NPPs containing nouns marked to link cohesively may not be coreferential: lexical cohesion is a matter of words, not things.
The traditional figures of words and thought are more manouvers (or tropes) than elements of textual or grammatical structure. They are not anchored to any particular category and so are represented as "inclusions" --tags which can appear anywhere without limitation. The tag FIG has traditional names of figures listed as values for the attribute type, and the correct value should also be typed in as PCDATA in the tag, so that it can be tabulated. At present, the figures supplied are anaphora (in sense of repetition), antithesis, asyndeton, correctio, diacope, hyperbold, ironia, isocolon, metonomy, polyptoton, sermocinatio,simile, synecdoche).