Semantically Indexed Hypermedia:
Linking Information Disciplines

Douglas Tudhope and Daniel Cunliffe

University of Glamorgan Web: http://www.glam.ac.uk/
Hypermedia Research Unit Web: http://www.comp.glam.ac.uk/pages/research/hypermedia/
School of Computing Web: http://www.comp.glam.ac.uk/
Pontypridd, CF37 1DL, Wales, UK

Email: mailto:dstudhope@glam.ac.uk djcunlif@glam.ac.uk
Web: http://www.comp.glam.ac.uk/people/staff/dstudhope/ http://www.comp.glam.ac.uk/people/staff/djcunlif/

Categories and Subject Descriptors: H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia; H.3.1 [Information Storage and Retrieval]: Thesauruses; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; H.3.5 [Information Storage and Retrieval]: web-based services; H.3.7 [Information Storage and Retrieval]: Digital Libraries
Additional Key Words and Phrases: Semantic index, Semantic distance measures, Metadata, Dublin Core

Semantic linking has always been a strand of hypermedia research and is becoming central to current attempts to facilitate access to information in large hypertexts and the emerging 'semantic web' [Berners-Lee 1998a]. Due to the scaling problems with explicitly authored links between information items, it is likely that future large scale hypertexts will employ a mixture of authored links and indirect, computed links via some form of indexing system. Problems of information access are heightened by the lack of precision of current WWW retrieval technology and users unfamiliar with indexing conventions. There is a critical need for tools that will assist users to formulate and refine queries, and navigate through information spaces. Recent years have seen the growth of metadata, Digital Libraries, and interest in the application of traditional information science and library cataloguing techniques to the new environment of hypertext and the WWW. Semantic indexing provides a bridge between the various information disciplines. With the growing influence of the Resource Description Framework [Lassila 1999], semantic tagging and cataloguing of information is likely to become a key component of the information architecture of intranet hypertexts and the WWW.

1 Semantic indexing

Representing semantic knowledge about a domain or application area in order to facilitate access to information has been a major focus in hypermedia, since the early days (e.g. [Collier 1987], [Trigg 1986]). One approach has been to assign semantic labels or more formal typing to authored hypertext links [Nanard 1991], [Schnase 1993]. Another approach, the one followed here, includes a semantic index layer in its model of hypermedia architecture. In addition to explicitly authored links, each information item is indexed with descriptor terms - frequently more than one term will be required. Frisse and Cousins [Frisse 1989] first introduced the notion of separate index and document spaces to hypertext, observing that different conformations of those spaces allow for different possibilities in automated reasoning. Different types of indexing system are possible. It is useful to categorise indexing systems according to three dimensions [van Rijsbergen 1979]:

whether index terms are automatically derived or manually assigned.
whether index terms belong to a controlled vocabulary or are uncontrolled ('free').
whether terms can be combined as ordered strings representing a single concept when indexing (pre-coordinated terms), e.g. "Association of Computing Machinery", or must be post-coordinated on retrieval. The latter allows the possibility of 'false positives' where items are returned that have no connection between different terms in the source string.

Information Retrieval (IR) has tended towards automatically generated free text index terms (post-coordinated), weighted by statistical frequency of terms in documents and collections. On the other hand, distinguishing features of a semantic index are that semantic relationships exist between controlled index terms, usually (but not necessarily) the result of manual cataloguing. Semantically indexed hypermedia links are, by definition, computed, corresponding to Intensional-Retrieval links [DeRose 1989]. This allows the possibility of flexible query-based navigation tools.

2 Thesauri and Classification Systems

The semantic index approach employs a set of semantic relationships between index terms, following the well established thesaurus tradition in information science (ISO 2788, ISO 5964). A large number of thesauri exist, covering a variety of subject domains, for example the Medical Subject Headings [MeSH 1999] and the Art and Architecture Thesaurus [AAT 1999]. Classification systems, such as Dewey Decimal or Library of Congress, focus on hierarchical relationships. These controlled vocabularies are part of standard cataloguing practice in libraries and museums and are now being applied to digital hypertexts via thematic keywords in metadata resource descriptors. For example, the Dublin Core [DC 1999] standard metadata set includes elements for Title, Creator, Date, Format, etc. in addition to the more complex notion of the Subject (or theme) of a resource. Guidelines recommend that, where possible, the Subject element be taken from a relevant controlled vocabulary. Links between concepts in the subject domain can be expressed by the semantic relationships in a thesaurus. The three main thesaurus relationships are Equivalence (equivalent terms), Hierarchical (broader/narrower terms), and Associative (more loosely Related Terms). Sometimes specialisations of the three main relationships are included (for example distinguishing taxonomic and instance hierarchical relationships). Following a minimalist approach to semantic modelling by restricting the set of relationships permits interoperability of cataloguing/retrieval tools and techniques. It also facilitates automated reasoning over this core set of relationships.

3 Using semantic index links

Navigation is provided indirectly by queries to the semantic index space, as opposed to directly following explicit links between information items. The queries can be simple or complex. The conventional hypermedia navigation techniques may be implemented by relatively simple queries [Tudhope 1994], although there would be no particular reason to use a semantic index to achieve that functionality. One additional possibility provided by a semantic index space is an organised set of browsable concept descriptors, as a means of comprehending the associated layer of media items [Bruza 1990], [Pollard 1993]. The user can browse the index space, 'beam down' to view media items of interest, and conversely 'beam up' to the index space from media items. Additionally, when index terms are combined, the user may browse around each term, broadening and narrowing the specificity of description and seeing the effect on likely 'hits' [Pollitt 1997]. Alternatively, the combined terms can be considered as locating a position in a 'hyperindex', permitting a string of terms to be broadened or narrowed in one navigation action [Bruza 1990]. If a user enters a set of query terms as opposed to browsing the index space, equivalence relationships permit a broad entry vocabulary of synonyms to be tied together for retrieval purposes, without the user having to specify the exact term employed for indexing. As a simple example, this document is indexed by a set of controlled vocabulary terms from the ACM Computing Classification [ACM 1998] (see Categories and Subject Descriptors above). In the ACM Digital Library pages, explicit hypertext links can be navigated. In addition, controlled vocabulary index terms can be combined with free text terms when searching the library and the hypertext version of the classification can be browsed as a subject index in order to select terms for searching.

Beyond this, the inclusion of semantic information in the index space provides the opportunity for knowledge-based hypermedia systems that provide intelligent navigation support and retrieval, with the system taking a more active role in the navigation process than relying on manual browsing alone. For example, rules governing permitted combinations of terms can filter a user's possible navigation options [Arents 1993], [Rada 1993]. Work at the University of Glamorgan explores the potential of reasoning over the semantic relationships in the index space. Traversal of transitive relationships makes possible imprecise matching between query and media item, or between two media items, rather than relying on an exact match of controlled vocabulary terms [Tudhope 1997]. Expanding terms offers an augmented browsing capacity based on measures of distance in the semantic index space. Results can be post-processed for expression in a particular retrieval tool. Various possibilities exist for indirect computed links with such hybrid query/navigation tools [Cunliffe 1997]. For example, information items with semantically close terms can be ranked in the result or destination set, or the system might automatically suggest terms to be considered for inclusion in a query. If facets exist for time and place in the index space, then a result set can be returned as a dynamic guided tour based on temporal or spatial relationships (or indeed other orderings). Alternatively, the focus of a user's navigation can remain in the document (media) space, typically requiring less cognitive overhead than constructing a formal query [Marchionini 1995]. In this case, having found an information item of interest, the navigation action consists of requesting "More items like this one", with the system responsible for a (best-match) similarity measure of the item's index terms. At the cost of greater cognitive demand on the user, the source context for the navigation may be modified and particular media items or terms (de)emphasised (cf. relevance feedback techniques in IR).

4 Key application to RDF and the WWW

Semantically based retrieval underpins diverse efforts to provide access to distributed multimedia resources, such as the many projects involving SGML (XML) and Z39.50 for networked access to cross-platform information. Major efforts are underway to create subject-based gateways to Internet resources, sometimes combining manually indexed and robot harvested metadata. The W3C Recommendation for a 'machine-understandable' Resource Description Framework supports the thrust of this research [Lassila 1999]. An RDF descriptor might include the Dublin Core element, Subject, specifying a classification or thesaurus to which keywords belong. Precise semantic index retrieval tools will be required to provide a manageable set of results to requests that may span several collections [Doerr 1997], and may involve networked terminology servers and more than one thesaurus or classification. One point worth emphasising is the social dimension to access and the link with existing cataloguing practice. Controlled vocabularies are often the result of standards efforts in subject domains, continue to evolve, and are part of a network of practice and education/training in the information science community. They have the potential to act as a bridge between information provider and seeker, "a semantic road map for searchers and indexers" [Soergel 1995], if tools can be devised that visualise their structure and how they may be used.

5 Research issues

A number of key issues for research remain if the potential of significant gains in precision of information access is to be realised.

An advantage of building query functionality into hypertext navigation is a smooth transition between querying and browsing. Can we identify the appropriate extent of cognitive effort demanded by interfaces to navigation tools? How far should the internal workings of matching functions or the detail of the underlying semantic network be brought to the surface?
Some applications may lend themselves to the specialisation of the standard thesaurus relationships into richer sets, particularly the associative relationship. For example, in some situations it may be useful to distinguish various kinds of causal relationships from the generic associative relationship.
The problem of expressing similarity between pre-coordinated strings of semantic index terms needs further investigation. How much should be pre-computed and what can be left to dynamic computation? How best can we express syntax or structure in such strings? This effort converges with work on description logic ontologies [Bullock 1998], [Weinstein 1998].
Various efforts attempt to combine statistical IR and semantic controlled vocabulary approaches. For example, Agosti et al [Agosti 1995] propose a three layer architecture for Hypermedia IR systems combining a statistical index layer and a semantic (thesaurus) layer (see also [Aslandogan 1997], [Chiaramella 1996]). Studies of online searching behaviour have investigated conditions influencing choice of free text or controlled vocabulary terms (e.g. [Fidel 1991]). How should the two approaches be best integrated - should they be seen as different components of a toolkit, or should a matching function incorporate both statistical weighting and semantic measures? In addition, indirect semantic links and explicit authored links will soon be combined in link/search engines. What principles should guide this integration?
The semantic interoperability of overlapping but different thesauri is an important issue for remote access to distributed sets of resources employing controlled vocabularies in metadata. A concept may exist in one vocabulary but not another, or may map (partially) to various concepts.

References

[AAT 1999] Art and Architecture Thesaurus Browser, [Online: http://shiva.pub.getty.edu/aat_browser/], 1999.

[ACM 1998] ACM Computing Classification. http://www.acm.org/class/1998/

[Agosti 1995] Maristella Agosti, Massimo Melucci, and Fabio Crestani. "Automatic Authoring and Construction of Hypermedia for Information Retrieval" in ACM Multimedia Systems, 3(1), 15-24, 1995.

[Arents 1993] Hans C. Arents and Walter F. L. Bogaerts. "Navigation without Links and Nodes without Contents: Intensional Navigation in a Third-Order Hypermedia System" in Hypermedia, 5(3), 187-204, 1993.

[Aslandogan 1997] Y. Alp Aslandogan, Chuck Thier, Clement T. Yu, Jon Zou, and Naphtali Rishe. "Using Semantic Contents and WordNet in Image Retrieval" in Proceedings of ACM SIGIR '97, 286-295, 1997.

[Berners-Lee 1998a] Tim Berners-Lee. World Wide Web Design Issues: A Roadmap to the Semantic Web, [Online: http://www.w3.org/DesignIssues/Semantic.html], 1998.

[Bruza 1990] Peter Bruza. "Hyperindices: A Novel Aid for Searching in Hypermedia" in Proceedings of the ACM European Conference on Hypertext '90 (ECHT '90), Versailles, France,109-122, November 1990.

[Bullock 1998] Joseph Bullock and Carole Goble. "TourisT: The Application of a Description Logic based Semantic Hypermedia System for Tourism" in Proceedings of ACM Hypertext '98, Pittsburgh PA, 132-141, June 1998.

[Chiaramella 1996] Yves Chiaramella and Ammar Kheirbek. "An Integrated Model for Hypermedia and Information Retrieval" in Information Retrieval and Hypertext, Maristella Agosti and Alan Smeaton (editors), Kluwer, 139-178, 1996.

[Collier 1987] George Collier. "Thoth-II: Hypertext with Explicit Semantics" in Proceedings of ACM Hypertext '87, Chapel Hill, NC, 269-289, November 1987.

[Cunliffe 1997] Daniel Cunliffe, Carl Taylor, and Douglas Tudhope. "Query-based Navigation in Semantically Indexed Hypermedia" in Proceedings of ACM Hypertext 97, Southampton, UK, 87-95, April 1997.

[DC 1999] Dublin Core. [Online: http://purl.org/metadata/dublin_core], 1999.

[DeRose 1989] Steven J. DeRose. "Expanding the Notion of Links" in Proceedings of ACM Hypertext '89, Pittsburgh, PA, 249-257, November 1989.

[Doerr 1997] Martin Doerr, Irene Fundulaki and Vassilis Christophidis. "The Specialist Seeks Expert Views: Managing Digital Folders in the AQUARELLE Project" in Proceedings of Museums and the Web, David Bearman and Jennifer Trant (editors), 261-270, 1997.

[Fidel 1991] Raya Fidel. "Searchers' Selection of Search Keys (I-III)" in Journal of American Society for Information Science, 42(7), 490-527, 1991.

[Frisse 1989] Mark E. Frisse and Steven B. Cousins. "Information retrieval from hypertext: Update on the Dynamic Medical Handbook" in Proceedings of ACM Hypertext '89, Pittsburgh, PA, 199-211, November 1989.

[Lassila 1999] Ora Lassila and Ralph Swick (editors), "Resource Description Framework (RDF) Model and Syntax Specification" World Wide Web Consortium Recommendation, [Online: http://www.w3.org/TR/REC-rdf-syntax/], February 22 1999.

[Marchionini 1995] Gary Marchionini. Information Seeking in Electronic Environments. Cambridge University Press, 1995.

[MeSH 1999] MeSH 1999. Medical Subject Headings homepage. http://www.nlm.nih.gov/mesh/meshhome.html

[Nanard 1991] Jocelyne Nanard and Mark Nanard. "Using structured types to incorporate knowledge in hypertext" in Proceedings of ACM Hypertext '91, San Antonio, TX, 329-344, December 1991.

[Pollard 1993] Richard Pollard. "A hypertext-based thesaurus as a subject browsing aid for bibliographic databases" in Information Processing and Management, 29(3), 345-357, 1993.

[Pollitt 1997] Steven Pollitt, Martin P Smith and Patrick A J Braekevelt. "View-based Searching Systems" in Proceedings of Joint Workshop of BCS IR and HCI Specialist Groups, (Johnson and Dunlop eds.) 73-77.

[Rada 1993] Roy Rada, Weigang Wang, Alex Birchall. "Retrieval hierarchies in hypertext" in Information Processing and Management 29(3), 359-371, 1993.

[Schnase 1993] John L. Schnase, John J. Leggett, David L. Hicks, and Ron L. Szabo. "Semantic Data Modeling of Hypermedia Associations. ACM Transactions on Information Systems (TOIS), 11(1), 27-49, January 1993.

[Soergel 1995] Dagobert Soergel. "The Art and Architecture Thesaurus (AAT): a critical appraisal" in Visual Resources, 10(4), 369-400, 1995.

[Trigg 1986] Randall H. Trigg and Mark Weiser. "Textnet: A Network-based Approach to Text Handling" in ACM Transactions on Office Information Systems (TOIS), 4(1), 1-23, January 1986.

[Tudhope 1994] Douglas Tudhope, Paul Beynon-Davies, Carl Taylor, and Chris B. Jones. "Virtual Architecture Based on a Binary Relational Model: A Museum Hypermedia Application" in Hypermedia, 6(3), 174-192, 1994.

[Tudhope 1997] Douglas Tudhope and Carl Taylor. "Navigation via Similarity: Automatic Lnking Based on Semantic Closeness" in Information Processing and Management, 33(2), 233-242, 1997.

[van Rijsbergen 1979] C. J. "Keith" van Rijsbergen. Information Retrieval. Butterworth, 1979.

[Weinstein 1998] Peter C. Weinstein. "Ontology-based metadata: transforming the MARC legacy" in Proceedings of ACM Digital Libraries '98, 254-263, 1998.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.