|
ACM Computing Surveys 31(4),
December 1999, http://www.acm.org/surveys/Formatting.html. Copyright ©
1999 by the Association for Computing Machinery, Inc. See the permissions
statement below.
Semantically Indexed Hypermedia: Linking Information Disciplines
Douglas
Tudhope and Daniel
Cunliffe
University
of Glamorgan Web: http://www.glam.ac.uk/ Hypermedia
Research Unit Web: http://www.comp.glam.ac.uk/pages/research/hypermedia/ School
of Computing Web: http://www.comp.glam.ac.uk/
Pontypridd, CF37 1DL, Wales, UK
Email: mailto:dstudhope@glam.ac.uk djcunlif@glam.ac.uk Web: http://www.comp.glam.ac.uk/people/staff/dstudhope/
http://www.comp.glam.ac.uk/people/staff/djcunlif/
Categories and Subject Descriptors: H.5.4 [Information
Interfaces and Presentation]: Hypertext/Hypermedia; H.3.1
[Information Storage and Retrieval]: Thesauruses; H.3.3
[Information Storage and Retrieval]: Information Search and
Retrieval; H.3.5 [Information Storage and Retrieval]: web-based
services; H.3.7 [Information Storage and Retrieval]: Digital
Libraries
Additional Key Words and Phrases: Semantic index, Semantic
distance measures, Metadata, Dublin Core
Semantic linking has always been a strand of hypermedia research and is
becoming central to current attempts to facilitate access to information
in large hypertexts and the emerging 'semantic web' [Berners-Lee
1998a]. Due to the scaling problems with explicitly authored links
between information items, it is likely that future large scale hypertexts
will employ a mixture of authored links and indirect, computed links via
some form of indexing system. Problems of information access are
heightened by the lack of precision of current WWW retrieval technology
and users unfamiliar with indexing conventions. There is a critical need
for tools that will assist users to formulate and refine queries, and
navigate through information spaces. Recent years have seen the growth of
metadata, Digital Libraries, and interest in the application of
traditional information science and library cataloguing techniques to the
new environment of hypertext and the WWW. Semantic indexing provides a
bridge between the various information disciplines. With the growing
influence of the Resource Description Framework [Lassila
1999], semantic tagging and cataloguing of information is likely to
become a key component of the information architecture of intranet
hypertexts and the WWW.
Representing semantic knowledge about a domain or application area in
order to facilitate access to information has been a major focus in
hypermedia, since the early days (e.g. [Collier
1987], [Trigg
1986]). One approach has been to assign semantic labels or more formal
typing to authored hypertext links [Nanard
1991], [Schnase
1993]. Another approach, the one followed here, includes a semantic
index layer in its model of hypermedia architecture. In addition to
explicitly authored links, each information item is indexed with
descriptor terms - frequently more than one term will be required. Frisse
and Cousins [Frisse
1989] first introduced the notion of separate index and document
spaces to hypertext, observing that different conformations of those
spaces allow for different possibilities in automated reasoning. Different
types of indexing system are possible. It is useful to categorise indexing
systems according to three dimensions [van
Rijsbergen 1979]:
- whether index terms are automatically derived or manually assigned.
- whether index terms belong to a controlled vocabulary or are
uncontrolled ('free').
- whether terms can be combined as ordered strings representing a
single concept when indexing (pre-coordinated terms), e.g. "Association
of Computing Machinery", or must be post-coordinated on retrieval. The
latter allows the possibility of 'false positives' where items are
returned that have no connection between different terms in the source
string.
Information Retrieval (IR) has tended towards automatically generated
free text index terms (post-coordinated), weighted by statistical
frequency of terms in documents and collections. On the other hand,
distinguishing features of a semantic index are that semantic
relationships exist between controlled index terms, usually (but not
necessarily) the result of manual cataloguing. Semantically indexed
hypermedia links are, by definition, computed, corresponding to
Intensional-Retrieval links [DeRose
1989]. This allows the possibility of flexible query-based navigation
tools.
The semantic index approach employs a set of semantic relationships
between index terms, following the well established thesaurus tradition in
information science (ISO 2788, ISO 5964). A large number of thesauri
exist, covering a variety of subject domains, for example the Medical
Subject Headings [MeSH
1999] and the Art and Architecture Thesaurus [AAT
1999]. Classification systems, such as Dewey Decimal or Library of
Congress, focus on hierarchical relationships. These controlled
vocabularies are part of standard cataloguing practice in libraries and
museums and are now being applied to digital hypertexts via thematic
keywords in metadata resource descriptors. For example, the Dublin Core [DC
1999] standard metadata set includes elements for Title, Creator,
Date, Format, etc. in addition to the more complex notion of the Subject
(or theme) of a resource. Guidelines recommend that, where possible, the
Subject element be taken from a relevant controlled vocabulary. Links
between concepts in the subject domain can be expressed by the semantic
relationships in a thesaurus. The three main thesaurus relationships are
Equivalence (equivalent terms), Hierarchical (broader/narrower terms), and
Associative (more loosely Related Terms). Sometimes specialisations of the
three main relationships are included (for example distinguishing
taxonomic and instance hierarchical relationships). Following a minimalist
approach to semantic modelling by restricting the set of relationships
permits interoperability of cataloguing/retrieval tools and techniques. It
also facilitates automated reasoning over this core set of relationships.
Navigation is provided indirectly by queries to the semantic index
space, as opposed to directly following explicit links between information
items. The queries can be simple or complex. The conventional hypermedia
navigation techniques may be implemented by relatively simple queries [Tudhope
1994], although there would be no particular reason to use a semantic
index to achieve that functionality. One additional possibility provided
by a semantic index space is an organised set of browsable concept
descriptors, as a means of comprehending the associated layer of media
items [Bruza
1990], [Pollard
1993]. The user can browse the index space, 'beam down' to view media
items of interest, and conversely 'beam up' to the index space from media
items. Additionally, when index terms are combined, the user may browse
around each term, broadening and narrowing the specificity of description
and seeing the effect on likely 'hits' [Pollitt
1997]. Alternatively, the combined terms can be considered as locating
a position in a 'hyperindex', permitting a string of terms to be broadened
or narrowed in one navigation action [Bruza
1990]. If a user enters a set of query terms as opposed to browsing
the index space, equivalence relationships permit a broad entry vocabulary
of synonyms to be tied together for retrieval purposes, without the user
having to specify the exact term employed for indexing. As a simple
example, this document is indexed by a set of controlled vocabulary terms
from the ACM Computing Classification [ACM
1998] (see Categories and Subject Descriptors above). In the ACM
Digital Library pages, explicit hypertext links can be navigated. In
addition, controlled vocabulary index terms can be combined with free text
terms when searching the library and the hypertext version of the
classification can be browsed as a subject index in order to select terms
for searching.
Beyond this, the inclusion of semantic information in the index space
provides the opportunity for knowledge-based hypermedia systems that
provide intelligent navigation support and retrieval, with the system
taking a more active role in the navigation process than relying on manual
browsing alone. For example, rules governing permitted combinations of
terms can filter a user's possible navigation options [Arents
1993], [Rada
1993]. Work at the University of Glamorgan explores the potential of
reasoning over the semantic relationships in the index space. Traversal of
transitive relationships makes possible imprecise matching between query
and media item, or between two media items, rather than relying on an
exact match of controlled vocabulary terms [Tudhope
1997]. Expanding terms offers an augmented browsing capacity based on
measures of distance in the semantic index space. Results can be
post-processed for expression in a particular retrieval tool. Various
possibilities exist for indirect computed links with such hybrid
query/navigation tools [Cunliffe
1997]. For example, information items with semantically close terms
can be ranked in the result or destination set, or the system might
automatically suggest terms to be considered for inclusion in a query. If
facets exist for time and place in the index space, then a result set can
be returned as a dynamic guided tour based on temporal or spatial
relationships (or indeed other orderings). Alternatively, the focus of a
user's navigation can remain in the document (media) space, typically
requiring less cognitive overhead than constructing a formal query [Marchionini
1995]. In this case, having found an information item of interest, the
navigation action consists of requesting "More items like this one", with
the system responsible for a (best-match) similarity measure of the item's
index terms. At the cost of greater cognitive demand on the user, the
source context for the navigation may be modified and particular media
items or terms (de)emphasised (cf. relevance feedback techniques in IR).
Semantically based retrieval underpins diverse efforts to provide
access to distributed multimedia resources, such as the many projects
involving SGML (XML) and Z39.50 for networked access to cross-platform
information. Major efforts are underway to create subject-based gateways
to Internet resources, sometimes combining manually indexed and robot
harvested metadata. The W3C Recommendation for a 'machine-understandable'
Resource Description Framework supports the thrust of this research [Lassila
1999]. An RDF descriptor might include the Dublin Core element,
Subject, specifying a classification or thesaurus to which keywords
belong. Precise semantic index retrieval tools will be required to provide
a manageable set of results to requests that may span several collections
[Doerr
1997], and may involve networked terminology servers and more than one
thesaurus or classification. One point worth emphasising is the social
dimension to access and the link with existing cataloguing practice.
Controlled vocabularies are often the result of standards efforts in
subject domains, continue to evolve, and are part of a network of practice
and education/training in the information science community. They have the
potential to act as a bridge between information provider and seeker, "a
semantic road map for searchers and indexers" [Soergel
1995], if tools can be devised that visualise their structure and how
they may be used.
A number of key issues for research remain if the potential of
significant gains in precision of information access is to be realised.
- An advantage of building query functionality into hypertext
navigation is a smooth transition between querying and browsing. Can we
identify the appropriate extent of cognitive effort demanded by
interfaces to navigation tools? How far should the internal workings of
matching functions or the detail of the underlying semantic network be
brought to the surface?
- Some applications may lend themselves to the specialisation of the
standard thesaurus relationships into richer sets, particularly the
associative relationship. For example, in some situations it may be
useful to distinguish various kinds of causal relationships from the
generic associative relationship.
- The problem of expressing similarity between pre-coordinated strings
of semantic index terms needs further investigation. How much should be
pre-computed and what can be left to dynamic computation? How best can
we express syntax or structure in such strings? This effort converges
with work on description logic ontologies [Bullock
1998], [Weinstein
1998].
- Various efforts attempt to combine statistical IR and semantic
controlled vocabulary approaches. For example, Agosti et al [Agosti
1995] propose a three layer architecture for Hypermedia IR systems
combining a statistical index layer and a semantic (thesaurus) layer
(see also [Aslandogan
1997], [Chiaramella
1996]). Studies of online searching behaviour have investigated
conditions influencing choice of free text or controlled vocabulary
terms (e.g. [Fidel
1991]). How should the two approaches be best integrated - should
they be seen as different components of a toolkit, or should a matching
function incorporate both statistical weighting and semantic measures?
In addition, indirect semantic links and explicit authored links will
soon be combined in link/search engines. What principles should guide
this integration?
- The semantic interoperability of overlapping but different thesauri
is an important issue for remote access to distributed sets of resources
employing controlled vocabularies in metadata. A concept may exist in
one vocabulary but not another, or may map (partially) to various
concepts.
[AAT
1999] Art and Architecture Thesaurus Browser, [Online:
http://shiva.pub.getty.edu/aat_browser/], 1999.
[ACM
1998] ACM Computing Classification.
http://www.acm.org/class/1998/
[Agosti
1995] Maristella Agosti, Massimo Melucci, and Fabio Crestani.
"Automatic Authoring and Construction of Hypermedia for Information
Retrieval" in ACM Multimedia Systems, 3(1), 15-24, 1995.
[Arents
1993] Hans C. Arents and Walter F. L. Bogaerts. "Navigation
without Links and Nodes without Contents: Intensional Navigation in a
Third-Order Hypermedia System" in Hypermedia, 5(3), 187-204, 1993.
[Aslandogan
1997] Y. Alp Aslandogan, Chuck Thier, Clement T. Yu, Jon Zou,
and Naphtali Rishe. "Using Semantic Contents and WordNet in Image
Retrieval" in Proceedings of ACM SIGIR '97, 286-295, 1997.
[Berners-Lee
1998a] Tim Berners-Lee. World Wide Web Design Issues: A
Roadmap to the Semantic Web, [Online:
http://www.w3.org/DesignIssues/Semantic.html], 1998.
[Bruza
1990] Peter Bruza. "Hyperindices: A Novel Aid for Searching
in Hypermedia" in Proceedings of the ACM European Conference on Hypertext
'90 (ECHT '90), Versailles, France,109-122, November 1990.
[Bullock
1998] Joseph Bullock and Carole Goble. "TourisT: The
Application of a Description Logic based Semantic Hypermedia System for
Tourism" in Proceedings of ACM Hypertext '98, Pittsburgh PA, 132-141, June
1998.
[Chiaramella
1996] Yves Chiaramella and Ammar Kheirbek. "An Integrated
Model for Hypermedia and Information Retrieval" in Information Retrieval
and Hypertext, Maristella Agosti and Alan Smeaton (editors), Kluwer,
139-178, 1996.
[Collier
1987] George Collier. "Thoth-II: Hypertext with Explicit
Semantics" in Proceedings of ACM Hypertext '87, Chapel Hill, NC, 269-289,
November 1987.
[Cunliffe
1997] Daniel Cunliffe, Carl Taylor, and Douglas Tudhope.
"Query-based Navigation in Semantically Indexed Hypermedia" in Proceedings
of ACM Hypertext 97, Southampton, UK, 87-95, April 1997.
[DC
1999] Dublin Core. [Online:
http://purl.org/metadata/dublin_core], 1999.
[DeRose
1989] Steven J. DeRose. "Expanding the Notion of Links" in
Proceedings of ACM Hypertext '89, Pittsburgh, PA, 249-257, November 1989.
[Doerr
1997] Martin Doerr, Irene Fundulaki and Vassilis
Christophidis. "The Specialist Seeks Expert Views: Managing Digital
Folders in the AQUARELLE Project" in Proceedings of Museums and the Web,
David Bearman and Jennifer Trant (editors), 261-270, 1997.
[Fidel
1991] Raya Fidel. "Searchers' Selection of Search Keys
(I-III)" in Journal of American Society for Information Science, 42(7),
490-527, 1991.
[Frisse
1989] Mark E. Frisse and Steven B. Cousins. "Information
retrieval from hypertext: Update on the Dynamic Medical Handbook" in
Proceedings of ACM Hypertext '89, Pittsburgh, PA, 199-211, November 1989.
[Lassila
1999] Ora Lassila and Ralph Swick (editors), "Resource
Description Framework (RDF) Model and Syntax Specification" World Wide Web
Consortium Recommendation, [Online: http://www.w3.org/TR/REC-rdf-syntax/],
February 22 1999.
[Marchionini
1995] Gary Marchionini. Information Seeking in Electronic
Environments. Cambridge University Press, 1995.
[MeSH
1999] MeSH 1999. Medical Subject Headings homepage.
http://www.nlm.nih.gov/mesh/meshhome.html
[Nanard
1991] Jocelyne Nanard and Mark Nanard. "Using structured
types to incorporate knowledge in hypertext" in Proceedings of ACM
Hypertext '91, San Antonio, TX, 329-344, December 1991.
[Pollard
1993] Richard Pollard. "A hypertext-based thesaurus as a
subject browsing aid for bibliographic databases" in Information
Processing and Management, 29(3), 345-357, 1993.
[Pollitt
1997] Steven Pollitt, Martin P Smith and Patrick A J
Braekevelt. "View-based Searching Systems" in Proceedings of Joint
Workshop of BCS IR and HCI Specialist Groups, (Johnson and Dunlop eds.)
73-77.
[Rada
1993] Roy Rada, Weigang Wang, Alex Birchall. "Retrieval
hierarchies in hypertext" in Information Processing and Management 29(3),
359-371, 1993.
[Schnase
1993] John L. Schnase, John J. Leggett, David L. Hicks, and
Ron L. Szabo. "Semantic Data Modeling of Hypermedia Associations. ACM
Transactions on Information Systems (TOIS), 11(1), 27-49, January 1993.
[Soergel
1995] Dagobert Soergel. "The Art and Architecture Thesaurus
(AAT): a critical appraisal" in Visual Resources, 10(4), 369-400, 1995.
[Trigg
1986] Randall H. Trigg and Mark Weiser. "Textnet: A
Network-based Approach to Text Handling" in ACM Transactions on Office
Information Systems (TOIS), 4(1), 1-23, January 1986.
[Tudhope
1994] Douglas Tudhope, Paul Beynon-Davies, Carl Taylor, and
Chris B. Jones. "Virtual Architecture Based on a Binary Relational Model:
A Museum Hypermedia Application" in Hypermedia, 6(3), 174-192, 1994.
[Tudhope
1997] Douglas Tudhope and Carl Taylor. "Navigation via
Similarity: Automatic Lnking Based on Semantic Closeness" in Information
Processing and Management, 33(2), 233-242, 1997.
[van
Rijsbergen 1979] C. J. "Keith" van Rijsbergen. Information
Retrieval. Butterworth, 1979.
[Weinstein
1998] Peter C. Weinstein. "Ontology-based metadata:
transforming the MARC legacy" in Proceedings of ACM Digital Libraries '98,
254-263, 1998.
Permission to make digital or hard
copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work
owned by others than ACM must be honored. Abstracting with credit is
permitted. To copy otherwise, to republish, to post on servers, or to
redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Publications Dept, ACM Inc., fax +1 (212)
869-0481, or permissions@acm.org.
|