Ordering Author and Work Records:
An Evaluation of Collocation in
Online Catalog Displays
[Published in Journal of the
American Society for Information Science, 47, 7 (July 1996): 538-554.]
Allyson Carlyle
E-mail: acarlyle@kentvm.kent.edu
216/678-3528 (Home)
216/672-2782 (Work)
216/672-7965 (Fax)
ABSTRACT
To investigate the extent to which online catalogs arrange
together, or collocate, records representing particular authors and works, a
survey compared the displays resulting from five author and five work queries
in eighteen online catalogs. Dependent
variables to measure collocation included the number of times irrelevant
records were interfiled among relevant records.
Searches for worst-case authors and works associated with large
retrieval sets, including "Homer" and "Paradise Lost,"
revealed the effects of Boolean versus string matching, query type, and catalog
size on the collocation of relevant records.
Results of the survey showed that string matching collocated relevant
records more successfully than Boolean matching, that author records were
collocated more successfully than work records, and, surprisingly, that catalog
size had only a small effect on collocation.
Introduction
Author
and work searching are frequently considered to be non-problematic aspects of
online catalog use. However, this is not
necessarily true. Consider the following
scenario. A user of UCLA's online
catalog ORION is interested in checking out a textual edition of the Bible.
This user selects ORION's Boolean keyword
"find title" search using the term "bible" as a query
term. A title search on
"bible" retrieves approximately 18,000 bibliographic records in
ORION, including a record for the book Animals
of the Bible by Isaac Asimov, a sound recording of Genesis read by Judith Anderson, and a work called Woman's Bible for Survival in a Violent
Society by Thomas P. McGurn. These records are scattered among those for
various textual editions of the Bible. The first record for a textual edition of the
Bible appears only after
approximately 150 other records have been displayed; in fact, a user would have
to look through well over 1,000 records to find the first significant grouping
of such editions. The ability of even an
experienced catalog user to find items relevant to a query is compromised by
large retrieval sets where many records are retrieved and irrelevant records
are scattered intermittently among relevant ones. Clearly, author and work searching are, at
times, quite as problematic as subject searching.
Displays
that alleviate the large retrieval set problem by facilitating the
identification and use of relevant records are particularly crucial in
information retrieval (IR) systems such as online catalogs, which serve users
whose understanding of the search process is limited or whose information needs
are not well-defined. Research suggests
that the large retrieval set problem is prevalent in online catalogs and that
long displays confuse and discourage users (for example, Wiberley,
Daugherty, & Danowski, 1990). As illustrated in the Bible example, one of the reasons that large retrieval sets may be
confusing is that irrelevant records may be scattered intermittently among
relevant ones. An international
cataloging standard addressing this problem requires that catalogs arrange and
display together (collocate)
records representing the same author, work, or subject. Displays meeting this requirement have been
mandated because catalogers believe they help make the organization and content
of retrieved record sets clear and thus provide users with a means of surveying
the range of records retrieved in a large set quickly and efficiently.
This
paper examines variables affecting the collocation of records for particularly
problematic authors and works retrieved in large record sets in online
catalogs. The records studied represent
"worst-case" authors and works, that is, authors and works associated
with a large number of relevant records.
Worst
cases were used because those cases retrieving large numbers of records, both
relevant and irrelevant, are more likely to be susceptible to arrangement
problems in display and, as a consequence, illustrate them more clearly than
searches retrieving few records. Because
the purpose of research such as this is to improve online systems, it makes
sense to direct our attention toward the most problematic aspects of those systems. Worst-case queries are also more likely to be
performed in online catalogs than other types of queries. Authors and works that appear in many
editions do so precisely because publishing companies perceive a demand for
them. Research by Nelson (1988)
discovered a correlation between the number of times a term was indexed and the
number of times it was used. Solomon
(1993) found that children use a small number of terms frequently; 100 terms
accounted for over 50 percent of the 1,210 terms children used in his study of
online catalog use.
Authors
and works were selected as query types because they have a distinct advantage
over subjects in that the relevance of a particular record to a query may be
posited with greater certainty. This
control over the relevance problem in retrieval frees the research somewhat
from the problems engendered by user relevance judgments.
BACKGROUND
Online
catalogs have frequently been criticized as being confusing and hard to
use. One of the reasons for this may be
that users are confronted with features that vary from catalog to catalog. Borgman identified
interface complexity as one of the reasons that online catalogs are hard to use
(1986, p. 393). Research on online
catalog use supports this assertion. Dalrymple compared user experiences in online and card
catalogs and found that card catalog users were more satisfied with their
search results and reformulated their searches less frequently than online
catalog users (1990). The 1983 Council
on Library Resources (CLR) online catalog study found that user problems
centered around "search formulation control and
output (display) control." (Matthews, Lawrence, & Ferguson, 1983, p. 123). In focus group research by Johnson and Connaway, one participant complained, "I hate the many
different ways there are to search for an author" (1992, p. 11).
Long displays resulting
from large retrieval sets may be one of the major contributors to user
confusion and frustration. Users
surveyed in the CLR study ranked "scanning through a long display"
fifth out of 27 reported problems and "understanding a display of multiple
items" 24th (Matthews, Lawrence, & Ferguson, 1983, p. 124). A transaction log analysis by Wiberley, Daugherty, and Danowski
showed that user persistence dropped off sharply when the number of retrieved
records exceeded 200; many users did not look beyond the first 30-35 records
(1995, pp. 262-263). Dwyer, Gossen, and Martin studied interlibrary loan requests for
items actually held by the requesting library.
They attributed six percent of failed searches to what they called the
"big-hit list" problem; that is, the title sought was one that was
displayed in incomplete form and was displayed in a long list of
similar-looking titles (1991, p. 232).
"Too many matches" was cited as a reason for not using online
catalogs in a catalog use study by Pease and Gouke
(1982, p. 288).
The
understandability of large retrieval sets may be compromised because records
are sometimes arranged in random (internal record number) order. Random arrangements are common in response to
Boolean searches. Because Boolean
searches often allow matches across multiple fields, no single field, for
example, an author field or a title field, stands out as a logical element by
which to arrange records. Research
suggests that record arrangements do have an effect on use. Abels (1993) found
that users scanned random record arrangements more slowly than records arranged
by author name, date, or subject heading.
Purgailis Parker and Johnson (1990)
investigated the effect of ordering and size of retrieved record sets on users'
relevance judgments. They found that
record order had no effect when record sets were composed of fewer than fifteen
records, but their results were inconclusive when sets exceeded fifteen
records.
Attig (1989), Duke (1989), and Delsey
(1989) identified ways in which computer technology has obstructed collocation
in online catalogs. Although collocation
of relevant records is a standard underlying the construction of both card and
online catalogs, no research has ever been undertaken to investigate to what
extent catalogs, card or online, actually achieve it. One of the reasons for the scarcity of this
type of research may be the difficulty in defining terms. Carpenter (1981) discussed the history of the
term "author" in cataloging codes and illustrated many of the
difficulties inherent in defining it.
"Work" has never been defined in a code of American cataloging rules,
although individuals have attempted to define it conceptually outside the codes
(Lubetzky, 1969; Svenonius,
1988; Wilson, 1989; O'Neill & Vizine-Goetz, 1989;
Smiraglia, 1992, pp. 8-9; Yee, 1995). Another reason for the lack of research on
collocation may be the difficulty in developing a means to measure it. Collocation has various aspects, making the
use of a single measure less than satisfactory.
SOLUTIONS TO DISPLAY PROBLEMS
Two major design
principles have been used in library catalogs to increase the comprehensibility
of retrieval sets. Record ranking based
on relevance to a query has been used in IR systems with the intention of
showing users first those records most likely to be relevant to their
queries. One online catalog using this
approach is OKAPI, developed at the Polytechnic of Central London (Walker &
de Vere, 1990).
Record arrangements based on relevance ranking may be particularly
useful for specific, well-defined subject queries because they attempt to
predict relevance based on the relationship between subject words in a query
and subject words in a record and thus produce meaningful record
orderings. The purpose of ranked
displays is to save the time of users by displaying first the records that best
match their searches. Queries for
authors and works, however, do not lend themselves so easily to record
ranking. On what basis is a retrieval
system to judge which of several identical author names or titles is the
"most" relevant to a particular query? Although it makes sense to base retrieval on the probability of a record's relevance to
a query for an author or a work, it is questionable that ranked orderings would be particularly helpful; it is more
questionable still that such orderings would be comprehensible to users.
The
collocation standard, mentioned above, was developed to increase the
comprehensibility of retrieval sets. It
stipulates that relevant author, work, and subject records be arranged and
displayed together, one after another and without interruption by irrelevant
records (Cutter, 1904, p. 12; Lubetzky, 1960, p. ix;
International Federation of Library Associations, 1971, p. xiii). A rationale underlying this standard, in
which groupings of relevant records are arranged with other groupings in
alphabetical order is that it is not possible to
predict which of any particular group may be the most useful. Displays that collocate
groups of relevant records may be useful to users whose information needs
represent broad topics or whose needs are not well-defined, or to users whose
queries are poorly or incompletely formulated.
As Mann so aptly puts it in his discussion of the advantages of The Library of Congress Subject Headings,
"...we can sharpen questions that are unfocused
or fuzzy to begin with--a frequent
starting point for readers--by examining the array of precoordinated
subdivisions under a heading that spell out and distinguish its various aspects
in ways that we couldn't think of in advance; that is, the system will clarify
our range of options for us, thereby enabling users to ask better questions in the first place [italics in original]"
(1993, p. 123). The purpose of
collocation in display is thus somewhat different from that of displays based
on record ranking in that the aim of collocated displays is to provide users with
an overview, or picture, of the entire content of a retrieved record set. This concept has also appeared, although in
somewhat different form, in IR system research.
Korfhage (1991) articulated the importance of
presenting users with an overview in IR systems and proposed a model shifting
the emphasis in IR system design from retrieval to display.
RESEARCH DESIGN
This
research investigated the effect of catalog variables on collocation by
performing author and title queries on five worst case authors and works in
eighteen different online catalogs. The
following questions were asked: What is
the effect of the type of match, Boolean versus character-string match, on
collocation of particular worst-case author and work records in current online
catalogs? What is the effect of type of
query, author versus work, on collocation of relevant worst-case author and
work records? What is the effect of
catalog size on collocation of particular worst-case author and work records? To address these questions, "work"
and "author" needed to be defined and measures of collocation
developed.
Operational Definitions
The MARC record communications
format (MARC Formats for Bibliographic
Data, 1980) determines the structure of bibliographic records in online
catalogs and was used to operationalize the concepts
"author" and "work."
"Author" was operationally defined as the set of records
containing the same author name in a MARC personal author field (personal
author fields include 100, 400, 600, 700, or 800 MARC fields). For example, a record set representing the
author William James was comprised of records having in an author field the
field contents: [subfield a] James,
William, [subfield d] 1842-1910. Other
author subfields, if present, were disregarded.
"Work" was somewhat more
difficult to operationalize because of the ambiguous
nature of the term. What records should
be included in the work record set? For
instance, should the Pop-Up Christmas
Carol, a children's pop-up book version of Dickens' A Christmas Carol be considered to be an
edition of the Dickens work? The extent
to which the items in a work set must resemble each other is debatable and as a
result "work" has been defined by cataloging theorists in different ways. Furthermore, a work is usually identified by
both an author name and a title and while the Anglo-American Cataloguing Rules (AACR2) requires author names to
be normalized, normalization of titles is optional (AACR2, p. 484). For example,
one edition of John Milton's Paradise Lost may be entitled "
In this research, "work"
was defined in two different ways to accommodate two competing
perspectives. First, "work"
was defined narrowly to represent a set of records that share the same primary
author and title MARC field contents (work record set). For example, a record representing a textual
edition of Paradise Lost by John
Milton published by Norton Publishers and one representing a textual edition
published by Rineholt Publishers, which share the
same primary author and title MARC field contents, were considered to be
members of the same work record set. A
primary author field was defined as a 100 MARC field and a primary title field
was defined as a 240 or a 245 MARC field.
Second,
"work" was defined broadly to represent a set of records that may not
share both primary author and title fields, but may still be relevant to a
query for a work (superwork record set). A superwork record
set contains the records of a work record set and, in addition, the set of records related
to a work in that they contain the same author and title field contents, but in
secondary author and title fields as opposed to primary author and title
fields. So, for example, the record for
a vocal score for a symphonic poem based on Paradise
Lost by Marco Enrico Bossi
was considered to be a member of the related work record set for Paradise Lost because it contains the
author name "John Milton" and the title "Paradise Lost" in
a related work field. Secondary
author/title fields included 600 and 700 fields; a secondary title field was the
740 MARC field. For further discussion
of the operational definitions, see Carlyle (1994, Chap. 2).
Dependent Variables
Collocation,
which has to do with the position of the members of a set of relevant records
in a display of a retrieved record set, can be measured in several different
ways. Consider that a set of fifty Paradise Lost records may be preceded in
a display by ten unrelated or irrelevant records, followed by five irrelevant
records, or interrupted six times by the display of seventeen irrelevant
records. Four measures were used in this
research:
• interruption number: the number of times the display of relevant
records was interrupted by irrelevant records, including an interruption
preceding the display of the first relevant record and one following the
display of the last relevant record,
• intervening
records: the number of
irrelevant records that were interspersed among the relevant record set; that
is, the number of irrelevant records following the first relevant record and
preceding the last relevant record,
• average
interruption size: the average
size of the interruptions; that is, the ratio of the total number of irrelevant
retrieved to the total number of interruptions, and
• precision:
the ratio of relevant records to
records retrieved.
An example is given in Figure 1.
[Fig. 1 about here]
Interruption number and
intervening records indicate the
extent to which relevant records are scattered in display. McGarry and Svenonius (1991) first used interruption number as a measure in their study of displays of
multiple subject headings. A user
confronted with a display of records that is often interrupted by irrelevant
records may be puzzled and leave a search before finding a record of
interest. The research by Wiberley, Daugherty, and Danowsky
indicates that a large interruption preceding the first relevant record
displayed might mean search failure for users even when relevant records were
retrieved because they wouldn't look beyond the first screen when many records
were retrieved. Intervening records indicates how many irrelevant records a user
must pass over in order to see the entire set of relevant records. Again, a user may balk at the irrelevant
records and leave a search before finding records of interest or make the
assumption that the retrieved record set does not contain a particular item of
interest.
On
occasion, displays of irrelevant records continue for one or more screens,
depending on the size of the retrieved record set. As mentioned above, user persistence
decreases as retrieval size increases, and long displays of irrelevant records
may lower persistence even more. Average interruption size, which
measures the average size of irrelevant record interruptions, is thus a vital
measure because it gives an indication of when and how often a particular
retrieval set might cause a user to give up.
Precision is normally regarded as a
measure of retrieval effectiveness and not display effectiveness. However, it was used in this research
because, like a snapshot, it gives a quick indication or picture of the
character of the retrieved record set.
One sees from it how much noise a user must go through to see all the
relevant records. The other collocation
measures by themselves do not give the same kind of overall picture of the
retrieval set--one of the reasons precision has been consistently used in the
evaluation of IR systems. Precision is
included also because it may be used to provide a perspective from which to
view the other, more "true" measures of collocation. For example, when precision is very high, one
would assume that the number of interruptions would be low. However, if one found many interruptions with
high precision, one would be able to state more assuredly that the
interruptions were indeed related to the variable of interest and not simply a
by-product of a retrieval set in which a high percentage of the records
retrieved were relevant.
Independent Variables
Three
independent variables were tested for their effect on collocation: a) match type, that is, Boolean
"and" matching versus string matching; b) query type, that is, author
versus work versus superwork; and c) catalog
size. Boolean and
string matching often result in displays that order records in different ways
(Fig. 2). Boolean "and"
matches separate query terms, look for them independently, and often, so long
as they appear in the same record, retrieve records that have one query term in
one field and one in another. String
matches keep query terms together and retrieve records only if query terms
occur at the beginning of a field.
Boolean searching in particular has been cited as obstructing
collocation in online catalog displays because search results are often
arranged in online catalogs randomly.
Another reason for studying Boolean versus string matching is that, at
present, these two types of matching are the predominant types of matching
available in online catalogs, particularly those available in the
[Fig. 2 about
here.]
The
second variable investigated was query type:
author, work, or superwork. As discussed above, cataloging rules treat
these entities differently.
Normalization of author names is required while normalization of work
titles is not. Author record sets are
defined by the contents of a single type of field, author fields, while work
record sets are defined by the contents of two types of fields, author fields
and title fields. The presence of a
second type of field may add a level of complexity to record arrangement that
makes it difficult for catalogs to collocate these records. These issues have to do with record structure, that is, the MARC
fields in which relevant record content (author names and titles) is contained
and the record content itself. In
effect, the investigation of query type is also an investigation of record
structure.
The
last variable studied was catalog size.
The effect of catalog size was evaluated by selecting equal numbers of
large, medium, and small catalogs. A
large catalog was defined as having over 1,000,000 records, a medium catalog
between 300,000 and 1,000,000 records, and a small catalog fewer than 300,000
records. One of the reasons for studying
catalog size was the frequent observation that catalog size is, and should be,
a determiner of catalog design and cataloging rules (see, for example, Kilgour, 1979, pp. 34-35).
One might easily assume that the larger the catalog, the more difficult
it is to achieve good collocation; however, no research has been conducted to
test this assumption, nor has anyone looked at the impact of catalog size on
worst cases.
Worst-Case Method
A
random sample method is commonly used to assess overall performance of a
system. However, the very nature of an
overall assessment is that it gives all aspects of a system equal
attention. Thus, those areas responsible
for system breakdown would be given as much, or as little, attention as all
other areas. A worst-case method, on the
other hand, focuses attention on system weaknesses, which in turn elicits
information about those aspects of a system that most need improvement. Since collocation has been identified as an
aspect of online catalogs needing improvement, a worst-case method was
selected. Worst-case methods have also
been used to evaluate the performance of various types of systems in
engineering and other applied sciences (see, for example, Nassif,
Strojwas, & Director, 1986).
One
of the difficulties with using a worst-case method in this research was that
very little was known about what constitutes a "worst case" for
collocation in the online catalog environment.
Some types of queries have been identified as problematic for retrieval,
such as queries that retrieve zero hits or queries that retrieve large
retrieval sets, but virtually nothing has been said or is known about queries
that are problematic for display.
Because large retrieval sets may also be assumed to pose problems for
collocation, attributes of queries known to contribute to large retrieval sets,
i.e., queries consisting of a few words that occur frequently in a database or
a few words that are homonyms, were used to identify worst cases for this
research. However, what is it about a
query that makes it a worst case in terms of collocation? Relevant record sets retrieved in large retrieval
sets may or may not be collocated. It
was hypothesized that, in addition to having the attributes that contribute to
large retrieval sets, worst cases record sets were composed of records that
exhibited a wide variety of record structures.
A preliminary taxonomy of record structures was developed to help
identify candidate worst-case queries (see Carlyle, 1994, Appendix 2).
Actual
selection of worst-case queries proceeded as follows. Candidates for worst-case author and work
queries were identified based on a review of examples in AACR2, my own acquaintance with problematic cases, and suggestions
from practicing librarians. Candidate
worst-case queries were then searched in UCLA's ORION. A worst-case query was first defined as being
composed of individual words or names that retrieved 1,000 or more hits in an
ORION search. Queries that retrieved extremely large sets (for example,
10,000 records) were not selected.
Queries that met the 1,000 record criterion
were then pre-searched in several online catalogs. The final list of queries used was determined
by selecting those queries for the research that exhibited a wide variety of
record structures and exhibited poor collocation. Worst case queries selected included:
Authors Works
Homer Charles
Dickens. A Christmas Carol.
William James James
Joyce. Ulysses.
H.D. (Hilda Doolittle) John
Milton.
Alice Walker Sir
Thomas More. Utopia.
Peter Gray William
Shakespeare. Sonnets.
Ideally,
one would use a profile of a worst-case query to identify worst cases and then
select a random sample of those cases for research such as this. However, so little was known about the
characteristics of worst-case queries that random sampling of this type would
have been difficult and costly. One of
the purposes of this research was to collect information about worst cases so
that a taxonomy of record structures could be fully
developed.
Constants
In studies of operational systems,
unrelated variables may influence the results.
Efforts were made to control for these variables in this research as
much as possible. Database variables
unrelated to size were controlled by selecting, when possible, only those
databases that had the following characteristics:
1) over 75
percent of the library collection was contained in the online catalog
2) the library
collection was general in nature
3) the library
collection contained primarily English language materials
4) the library
collection was located in the
System variables unrelated to
match type were partially controlled by selecting, when possible, a large,
medium, and small sized online catalog designed by the same vendor. Vendor selection was based on two characteristics: (1) high numbers of installations in
libraries (statistics on vendor installation were available in Bridge (1992)),
and (2) availability of catalogs using those vendors via Internet
connections. Record structure variables
that might influence record arrangement or a record's membership in an author,
work, or superwork set were held constant. These variables were the conformance of
records to specifications mandated in AACR2
and the Library of Congress Subject
Cataloging Manual: Subject Headings
(1991). Conformance to cataloging standards
was controlled by dropping records whose non-conformance to standards had an
effect either on record ordering or on a record's membership in an author,
work, or superwork record set.
Data Collection
In
each catalog of the eighteen online catalogs surveyed, the five author names
were searched using all the types of author search permitted,
and the five works/superworks were searched using all
the types of title search permitted.
Data collection parameters are given in Table 1.
[Table 1 about here]
Author searches were not used to
retrieve works/superworks. For example, in catalogs that offered both a
string author search and a Boolean author search, both searches were used. An attempt was made to make author and title
searches across catalogs as comparable as possible, so when Boolean searches
were available that allowed limiting to author or title fields, author searches
were limited to author fields and title searches were limited to title
fields. If the Boolean "and"
was not the default operator, "and" was used in the construction of
the queries to standardize the queries across the catalogs surveyed. Sets retrieved in title searches were
analyzed twice, first to discover the extent of collocation of work record
sets, and then to discover the extent of collocation of superwork
record sets.
An
assumption behind the collocating standard in cataloging history has been that
it applies to authors and works only when searching under an author's name;
that is, it mandates the collocation of the works of an author and the editions
of a work in a display under the name of an author, but not in a display under
title. Thus, in the card catalog,
additional entries were not made for uniform titles, which were used for
arrangement purposes under author name only.
For example, if one looked in a card catalog under the title Paradise Lost, one would not find a
French edition of
RESULTS
Data
were analyzed using descriptive statistics since neither the selection of
worst-case queries nor the sample of online catalogs was random. As sample sizes were relatively small and
standard deviations were large, medians were reported instead of means.
Match Type Results
The
first independent variable tested was type of match performed, Boolean versus
string match. Match type proved to be a
strong determiner of collocation in display, particularly for authors and superworks. String
matching was clearly superior to Boolean matching in collocating author records
(Tables 2, 3). Only ten percent of
string matches were interrupted more than twice by irrelevant records. In addition, when interruptions occurred,
they were small. Boolean matches, on the
other hand, were interrupted frequently.
Thirty-nine percent of Boolean author searches were interrupted twelve
or more times and the number of intervening records and average interruption
size were correspondingly large. As may
have been expected, precision was much higher for string matches (median .77)
than for Boolean matches (median .26).
[Tables 2 and 3 about here]
The
impact of match type on work record sets was much less pronounced than the
impact of match type on author record sets, although string matches again
outperformed Boolean (Tables 2, 4).
Results for interruption number and precision were surprisingly similar
for string and Boolean matches, with string matches providing only slightly
better collocation than Boolean. The
results for intervening records and average interruption size, however,
indicated string match superiority more clearly.
[Table 4 about here]
As
was true of authors, superworks were collocated
notably better with string matches than Boolean (Tables 2, 5). Superwork record
sets were interrupted much less often in string matches than in Boolean
matches. String matches had a median of
four interruptions, in contrast to Boolean matches, which had a median of
fourteen interruptions. The
distributions for intervening records echoed those for interruption
number. Fifty-eight percent of the
string matches were interruped by fewer than ten
intervening records, while 53 percent of the Boolean matches were interrupted
by 30 or more intervening records.
However, in contrast to the statistics for authors, superwork
statistics for average interruption size and precision were inconclusive
regarding the difference between string and Boolean matching.
[Table 5 about
here.]
Query
Type Results
The second aspect of collocation
examined was the effect of query type, author, work, or superwork,
on collocation. Query type had a strong
effect on collocation. Author record
sets were interrupted far less frequently and by far fewer numbers of
intervening records than work or superwork record
sets (Tables 6, 7). Seventy percent of
author record sets, as opposed to 23 and 24 percent of work and superwork record sets, respectively, were interrupted two
or fewer times. Median interruption
number shows this difference more dramatically.
Half of the author record sets were not interrupted at all, while work
and superwork record sets had a median of 15.5 and 18
intervening records respectively.
[Tables 6 and 7 about here.]
One of the most interesting results
of this study was the poor collocation of work record sets, which was
illustrated most clearly in average interruption size and precision
statistics. Median average interruption
size was largest for works at 9.7 records, while authors had a median of 5.1
records and superworks 3.2. Precision for works was distressingly
low--only three percent of work record sets had over .59 precision and nearly
90 percent were below .4. Median
precision for works was .15, in contrast to .61 and .47 for authors and superworks, respectively.
Catalog Size Results
The
last variable studied was catalog size.
Quite unexpectedly, catalog size had an inconsistent relationship to
collocation of relevant record sets.
Catalog size had a negligible effect on arrangement of author records
(Tables 8, 9). The distributions for
large, medium, and small catalogs for interruption number were almost
identical. Even the results for the two
measures based on simple record counts, intervening records and average
interruption size, which one would expect to be sensitive to catalog size,
showed a relatively small effect. In
fact, the distributions for intervening records were almost indistinguishable
from the distributions for interruption number.
Catalog size also had no discernible influence on precision. Only the statistics for average interruption
number revealed an effect for catalog size, with a median of four records in
small, five in medium, and seven in large catalogs, and even this effect was
smaller than may have been expected.
[Tables 8 and 9 about here]
Catalog
size had a more perceptible impact on collocation of work records than on
collocation of author records, although it was not consistent across the
measures (Tables 8, 10). In contrast to
the results for authors, the results for works showed the effect of catalog
size most when measured by interruption number and intervening records. The distributions for interruption number
demonstrate the effect of catalog size dramatically; 55 percent of work record
sets in small, thirteen percent in medium, and two percent in large catalogs
were interrupted two or fewer times by irrelevant records. The results for
intervening records demonstrated a similarly strong effect. In contrast, the results for average
interruption size demonstrate a smallish and uneven impact on work
records. Median average interruption
size was 7.3, 11.9, and 10.9 respectively in small, medium, and large
catalogs. The effect of catalog size on precision
was negligible, with median precision almost identical for all three catalog
sizes: .14 for small, .12 for medium,
and .17 for large catalogs, and what small differences there were, were
contrary to expectations.
[Table 10 about
here.]
The results for superworks
were similar to those for works (Tables 8, 11).
As expected, small catalogs performed notably better than medium or
large catalogs when measured by interruption number and intervening
records. However, average interruption
size revealed only a small difference.
In catalogs of all sizes, average interruption sizes were small. Only a query for More's
Utopia in a large catalog had an
average interruption size over 20 records. The impact of catalog size on
precision for superwork record sets was even less
noticeable than it was on average interruption size. Medians for all catalog sizes varied less
than .04 precision (.47, .44, and .45 median precision for small, medium, and
large catalogs).
[Table 11 about
here.]
MATCH TYPE DISCUSSION
The
claim made by various catalogers that Boolean matching is inferior to string
matching in collocating related records is supported by the findings of this
research. In particular, string matching
provided better collocation for author queries than did Boolean matching. The reason for this most likely has to do
with the number of fields matched.
String matches are by nature limited to a single field and they arrange
records by the field matched, ensuring author collocation. Boolean matching occurs frequently in
multiple fields, so no single field or type of field stands out as element of
arrangement. As a result, records are
often arranged randomly by internal record number.
That string matching may be superior
to Boolean matching in collocating related records has important ramifications
with respect to the types of searches online catalogs provide and how they
present them to users. Because
preliminary studies show that users do not use all the searches that are
available to them, some researchers are beginning to recommend selection of
default searches for authors, works, and subjects (Yee & Layne, in
press). Other types of searches could be
used either as backup for zero-hits queries or as advanced options.
Online catalogs exist that provide only multiple-field Boolean matching and
do not support string matching at all.
The findings of this research suggest that the selection of Boolean
multiple-field matching as a default for author and title queries may be
ill-advised, particularly for author queries.
Although all of the online catalogs surveyed provided author string
matching, not all provided author Boolean matching in addition. This may be because online catalog designers
have assumed the superiority of string matching for author queries. Boolean multiple-field match displays
frequently mask the presence of author terms because they display brief title
information instead of author name headings (Fig. 3) For online catalogs that provide
Boolean matching, limiting matches to a single field offers the advantages of
string matching in that the field matched may be used as the element
determining record arrangement, author name headings may be displayed, and
cross references may be easily provided.
Boolean single-field matches may, in fact, be preferable to string
matches in that they offer the flexibility to enter an author's name in any
order as well as the advantages of string matching. Such matching was provided for author queries
in only four out of thirteen online catalogs surveyed offering Boolean author
searches. Had single-field Boolean
matching been available in all the catalogs surveyed, the results of this
research for the effect of match type on author queries may have been quite
different.
[Fig. 3 about
here]
For
work queries, a default approach that is, as yet, unavailable in most catalogs
may be desirable. Yee and Layne (in
press) propose a default work search which allows catalog users to enter both
author name and title as a means of identifying a particular work of interest
more quickly and efficiently than title searches alone as long as authority
records are included in the search.
Recent research by Kilgour (1995) supports
this proposal in a study showing that queries composed of both author surname
and title words produced significantly smaller retrieval sets than queries
composed of title words alone.
The
results of this study also call into question the emphasis on Boolean searching
techniques, sometimes to the exclusion of all other searching techniques, in
bibliographic instruction for online retrieval systems (for example, see Reed,
1993). One reason for this emphasis may
be that Boolean searching techniques are more complex than string searching
techniques. However, instruction
stressing Boolean strategies to the exclusion of other strategies may lead
users to believe that such techniques work well for all types of queries, when,
as demonstrated by this research, this may very well not be the case.
More
research must be completed to further our understanding of the impact of match
type on catalog displays. Research
similar to the study reported here could determine the effect of match type on
collocation of all types of queries, not just worst cases. Also, future research could investigate the
effect of match type on catalog functions other than display, for instance, on
the recall of author and work
records. It would be useful as well to study match type in an experimental
setting in which system and database variables could be more completely
controlled. Finally, it is clear that
work must be done to develop methods to improve collocation in both Boolean and
string matching environments.
QUERY TYPE DISCUSSION
The notable effect of query type
indicates that record structure differences had an important impact on
collocation. Author records collocated
well, undoubtedly because record structure is simple; few MARC fields are
involved and AACR2 requires uniform
headings for authors. Moreover, string
matching collocated author records well, and more catalogs offered string
author matches than Boolean author matches.
Works and superworks
may be more difficult to collocate because record structures are more
complex. Works and superworks
are identified by two different types of information, author name and title,
and only a uniform author name, not a
uniform title, is required in AACR2. In addition, work records may contain varying
subtitles that cause records to scatter when arranged alphabetically by
title. Superworks
are sometimes identified by two fields, author and title, and sometimes by a
single field consisting of an author subfield and a title subfield (a
name-title added entry or a work-as-subject added entry).
Various strategies are needed to
enhance collocation of these records.
Work and superwork collocation could be
improved by both changing current record structures and by improving online
catalog systems. For instance, uniform
titles could be required by AACR2 in
all cases when a new edition of a work is published under a different
title. Online catalog systems could
improve collocation if work and superwork records
were arranged or grouped using both author name and title fields, instead of
relying solely on title fields or arranging records in random order.
Median precision was low for all
query types. This finding was to be
expected because one of the criteria for selection of worst cases was their
tendency to retrieve large numbers of records.
However, median superwork precision, .47, was
surprisingly higher than median work precision, .15. Only two work record sets had precision of .8
or better. It is quite possible that
libraries collect relatively few editions of a particular work, but, for
worst-case works, they collect many works related to or about that work. Still, it is notable even though work record
sets were composed of relatively few records, they
were afflicted with very poor collocation.
Interpreting the statistics for works, one sees that although work
record sets were not interrupted often, they were interrupted by very large
interruptions. The effect this could
have on users is profound because the few related records retrieved could be so
scattered that users would have difficulty identifying even one record relevant
to a query.
Superwork
record sets were interrupted far more frequently and by larger numbers of
intervening records than work record sets.
Several factors may explain this.
By definition, superwork record sets contain
more records than work record sets, and, as a result, they may be more
susceptible to interruption. Also, some
related work records were identified as such because of information they have
in secondary-title entries. Because
title commands sometimes arrange records by main-title fields (240 or 245
fields) and not by secondary-title fields, these records may be scattered
throughout a retrieved record set.
Online catalogs could improve superwork
collocation by using author and title fields of all types as the basis for
grouping related records.
Although collocation of superwork records was somewhat better than work collocation
as measured by average interruption size and precision, this finding is not
necessarily a reason for optimism as it indicates that related work records are
scattered intermittently among work records.
That superworks have more interruptions and
smaller average interruption sizes than works means that the large
interruptions in work record sets have been filled in by related work
records. For example, videorecordings of Scrooge
and The Jetson's
Christmas Carol may be interfiled with records for the Dickens text. Although this research was not intended to
test the arrangement of work and related-work records per se, different levels of arrangement may be crucial when a
retrieval set is large. The problems of
arranging records in a large work record set, dubbed "the Humphry Clinker problem," have been
studied by O'Neill and Vizine-Goetz (1987) and Svenonius (1988).
Solutions to these problems are being studied in an experimental system
at the
Future research should investigate
the effect of record ordering on catalog use.
Users could be presented with displays incorporating various record
orderings, for example, orderings based on format, date of publication, or
language, and user understanding and preference for each type of display could
be compared. Different types of
orderings may be preferable for different types of queries, and the suggestion
has been made that online catalogs offer more than one ordering option
(Buckland, Norgard, & Plaunt,
1993). Future investigation could also
compare the effect of the current list-type displays to the effect of graphic
displays.
CATALOG SIZE DISCUSSION
Catalog size produced the most
surprising results of this study.
Because two of the measures for collocation used in this study,
interruption number and intervening records, were simple counts, one would
expect them show the effect of catalog size and be larger in large catalogs
than in medium or small. It would also
be reasonable to expect that average interruption size and precision would show
the influence of catalog size. That
these expectations proved unfounded has serious implications for information
retrieval in online catalogs. The
intelligibility of displays even in small
databases may be compromised by large retrieval sets, and, thus, even small
online catalogs may have to be designed to deal with the problems presented by
them.
The fact that author collocation was
little affected by catalog size is perhaps the least surprising given the
strong effects of match type. Because
all of the catalogs surveyed provided both string and Boolean matches, it is
perhaps to be expected that catalog size had a less powerful influence. Even in large catalogs, string matches often
resulted in perfectly collocated author record sets. The reason for the unsystematic effect of
catalog size on collocation of work and superwork
records is less clear. As mentioned in
the discussion above, it is possible that record structure variations
associated with query type have a significant impact on the grouping of work
and superwork records. Catalog size had the strongest effect on work
collocation. The implication is that
when the influence of other variables is not strong, catalog size will have an
impact on collocation.
One reason that catalog size had a
relatively small impact overall may have to do with the nature of library
collections. Certain authors and works
are collected heavily and, thus, worst cases may exist in all general library
environments. The implications of this
are twofold. First, the large retrieval
set problem may be more universal than may have been supposed. When one considers the advent of retrieval in
a virtual library environment, in which full-texts and enhanced bibliographic
records are included in the retrieval pool, the prospect becomes even more
daunting. On a more optimistic note, if
a successful method of securing collocation is designed for small catalogs,
then the same method could be successful for larger catalogs. This certainly seems to be the case for
author string matching. A method of
collocating work and superwork records may also be
effective regardless of catalog size.
Assumptions regarding catalog size
have influenced the design of cataloging rules.
Cataloging folklore purports smaller catalogs to collect fewer editions
of works, and therefore have different needs with respect to the length of a
bibliographic record and the use of uniform title. AACR2
rule 1.0.D provides catalogers, presumably from smaller libraries, the option
of creating brief records. In AACR2 rule 25.1A, one of the reasons not
to use uniform titles is based on "the extent to which the catalogue is
used for research purposes," an implication being that smaller libraries
do not need to use uniform titles. The
results of this research demonstrate that these assumptions may be incorrect
with respect to worst cases, and that smaller libraries may be well-advised to
use all the techniques available to enhance collocation.
Further research using a larger
worst-case sample is necessary to confirm the observations of this study
regarding catalog size. It would also be
interesting to look at the incidence of worst cases in catalogs of various
sizes. The results of this study imply
that worst-case displays are comparable across catalogs. Does this also imply that users would have
the same success finding particular worst cases regardless of catalog size? The 1958 American Library Association card
catalog use study (Jackson, p. 15) found that users were more likely to find
the known-items they sought in small catalogs than in large catalogs. This research has not been replicated in the
online environment, nor has any research looked at the impact of catalog size
on users' ability to find worst-case authors or works.
GENERAL DISCUSSION
First, a word
about the dependent variables.
The measures that seemed most revealing of the extent to which relevant
record sets were collocated were interruption number and intervening
records. Average interruption size and
precision, on the other hand, were useful to contrast to interruption number
and intervening records to obtain a more complete picture of collocation. Although four measures of collocation may
seem somewhat unwieldy, one could imagine using even more to portray the
various aspects of collocation. In this
research, collocation was measured somewhat imprecisely in that the placement
of different types of relevant records with respect to each other was not
regarded, that is, all relevant records were treated equally. An author record set in which works by a
particular author and works about that author were scattered would have been
treated in this research as a perfectly collocated record set, when, in fact,
such scatter could be confusing or irritating to users. For future research on collocation, other,
more finely tuned, dependent variables could be defined to measure collocation.
The
results of this study have important implications for the revision of
cataloging rules. One of the reasons
that catalogers have supposed string matching to be superior to Boolean is that
string matching mimics retrieval in the manual environment. The techniques outlined in AACR2 to ensure collocation were created
for the manual environment. Collocation
in the manual environment is wholly dependent on record content, that is, on
the construction of uniform author and work headings. In a computer environment offering Boolean
and other types of matching and retrieval, uniform headings alone are not
sufficient to collocate author and work records. The constructors of AACR2 and earlier Anglo-American cataloging codes limited
themselves to the realm of record content perhaps, in part, because collocation
in the manual environment was largely guaranteed based on content alone. In so doing, however, they have abdicated
responsibility for meeting user needs with respect to display. If AACR2
is to truly support collocation of related author and work records in online
catalogs, then it must come to grips with the fact that it can no longer
restrict the province of the rules to record content alone and must include
rules for arrangement and display as well.
In other words, AACR2 must
become a code governing the construction of catalogs,
not just catalog records.
The
reluctance of code writers to incorporate rules for arrangement and display may
be due to a reluctance to specify exactly how
online catalogs must provide collocation.
As online catalog software varies from catalog to catalog, provisions
specifying how collocation should occur would be complex. However, precedents exist for AACR2 to specify an outcome without
specifying how that outcome must be accomplished. For example, many rules mandate the use of
cross references, but they do not specify how a particular catalog must create
them. In a similar manner, collocation
of author and work records could be mandated without specifying exactly how a
particular system must accomplish it.
One of the most critical areas for
future research is an investigation of the effect of collocation on catalog
use. To what extent do poorly collocated
record displays prevent users from finding the items they seek? For at least two centuries catalogers have
assumed that collocation of relevant records enhances use. However, no research has ever been done to
test this assumption. While it seems
likely that this assumption is valid, it would still be useful to test it
because any improvements in our catalogs should be justified by a clear gain
for users. As collocation is poor in
operational catalogs, it would be necessary to develop an experimental system
which provides perfectly collocated record displays. Because of the varieties of record structures
associated with work and superwork records, this task
is formidable. Two avenues present
themselves for accomplishing it. First
is "cleaning up" or enhancing existing worst-case records, for
example, adding uniform titles to all appropriate work records. This would be costly, so it would be desirable to
investigate the development of automatic collocation or linking procedures
using existing records.
It seems that little thought, much
less creative design work, has gone into the display of multiple records in
online catalogs. Few catalogs have used
graphic display software or sophisticated linking techniques to improve
multiple-record displays, particularly multiple-record displays of authors and
works. Although much work must be
completed before displays that collocate relevant records could actually be
implemented in online catalogs, it is not difficult to imagine how such
displays might look (Fig. 4). In a
graphic display environment using hierarchical tree structures to represent relevant
works or authors, users could simply click on the part of a tree that they are
interested in to retrieve other tree structure displays leading to specific
records of interest. Displays
incorporating work tree structures have also been suggested by Svenonius (1988, p. 7).
[Fig. 4 about here.]
Further work must be done to define
and identify worst cases. The definition
of worst case used in this research was based predominantly on characteristics
of retrieval. It was also somewhat loose
by design, as it was a first attempt at such a definition, and the individual
cases produced different results (Carlyle, 1994, pp. 120-141). A more thorough analysis of record structure
and arrangement could be used to provide a definition of worst case based on display
characteristics as opposed to retrieval characteristics. Use of a display-based definition might also
reveal the existence of various types of worst case, which would, in turn, be
useful for the development of automatic collocating procedures. It would also be useful to identify specific
cases that are problematic for users.
Identification of the actual cases that cause problems for users would
be invaluable if record enhancement were necessary to improve collocation as it
would allow researchers to limit the number of records enhanced.
CONCLUSION
The results of this research show
that online catalog displays sometimes scatter records relevant to a query
among irrelevant records and that multiple-field Boolean matching in particular
contributes to this scatter. In
addition, the findings of this study indicate that poor collocation may be a
problem for online catalogs regardless of their size, and that works are
collocated less successfully than authors or superworks. Although collocation is one of the standards
governing catalog design, this standard is obviously far from being operative
in current online catalogs. The results of this study may help to explain the
difficulties reported by catalog users who retrieve long displays. Displays that do not demonstrate
relationships among relevant items retrieved may leave users, at best,
disgruntled over the amount of time necessary to find what they are looking for
and, at worst, oblivious to the fact that the library actually holds the very
item or items they seek. As always, the
question remains--how do we design systems that help users complete their
searches easily, efficiently, and successfully?
Catalog displays that order records
based on their probable relevance to a query offer users whose information
needs are well-defined an essential strategy for identifying records relevant
to their needs. However, users whose
information needs are less well-defined or whose queries consist of specific
authors or works may require a different approach. Collocation of relevant records lays the
foundation for a strategy that may help users identify relevant records by
allowing them to see an overview of the records in a large retrieval set. Efforts by Larson (1991) and McGarry and Svenonius (1991),
demonstrate that large subject displays may be reduced by record grouping,
compression, and clustering. Reducing
large record sets by grouping collocated record sets may help users identify
relevant records by giving them an overview of the records that have been
retrieved. The more rapidly online
systems increase in size and complexity, the more urgently we need to solve the
problems these systems engender. The
principle of collocation may serve well in the creation of displays that guide
users successfully to the information they require.
ACKNOWLEDGEMENTS
I wish to express my thanks and deep
appreciation to Elaine Svenonius, chair of my
dissertation committee, for her encouragement, guidance, and support. I would also like to acknowledge the
thoughtful comments and advice of Ann Bein, Michčle Cloonan, Milos Ercegovac, Raya Fidel, Julie Gedeon, Sara Shatford Layne, Dorothy McGarry,
Dee Andy Michel, Richard E. Rubin, Diana M. Thomas, and Martha M. Yee. I am also grateful to the many librarians who
kindly answered my questions regarding the survey.
References
Abels, D. M. (1993). Sequencing Items in Multiple-item Displays on Online
Public Access Catalogs. Ph.D. diss.
Anglo-American Cataloguing
Rules. (1988). 2nd. ed.
revised.
Attig, J. C. (1989). Descriptive
Cataloging Rules and Machine-Readable Record Structures: Some Directions for Parallel
Development. In Svenonius,
E. (Ed.) The Conceptual Foundations of Descriptive Cataloging (pp. 135-148).
Ayres,
F. H., Nielsen, L. P. S., Ridley, M. J., & Torsun,
Borgman, C.L. (Nov.
1986). Why are
Online Catalogs Hard to Use? Lessons
Learned from Information-Retrieval Studies.
Journal of the
American Society for Information Science. 37(6), 387-400.
Bridge, F. R. (
Buckland, M. K., Norgard,
B. A., & Plaunt, C. (Sept.
1993). Filing, Filtering, and the First
Few Found. Information Technology and Libraries. 12, 311-319.
Carlyle, A. (1994). The Second Objective
of the Catalog: An Evaluation of
Collocation in Online Catalog Displays.
Ph.D. diss.
Carpenter, M. (1981). Corporate Authorship: Its Role in Library
Cataloging.
Cutter, C. A. (1904). Rules for a Dictionary
Catalog. 4th
ed., rewritten.
Dalrymple, P. W.
(June 1990). Retrieval by
Reformulation in Two Library Catalogs:
Toward a Cognitive Model of Searching Behavior. Journal of the American Society for Information Science. 41(4), 272-281.
Delsey, T. (1989). Standards for
Descriptive Cataloging: Two Perspectives
on the Past Twenty Years. In Svenonius, E. (Ed.) The Conceptual Foundations of Descriptive
Cataloging (pp. 51-60).
Duke, J.
K. (1989). Access and Automation: The Catalog Record in the Age of
Automation. In Svenonius,
E. (Ed.) The Conceptual Foundations of Descriptive Cataloging (pp. 117-128).
Dwyer,
C.M., Gossen, E.A., & Martin, L.M. (1991). Known-Item Search Failure in an
OPAC. RQ. 31(2), 228-236.
International Federation of Library Associations. (1971). Statement of Principles Adopted at the International
Conference on Cataloguing Principles,
Jackson, S. L.
(1958). Catalog Use Study.
Johnson, D. W. & Connaway, L. S. (1992). Use of Online Catalogs: A Report of the Results of Focus Group
Interviews. Typescript.
Kilgour, F. G. (1979). Design of Online
Catalogs in The Nature and Future of the
Catalog: Proceedings of the
Kilgour, F. G. (Mar.
1995). Effectiveness
of Surname-Title-Words Searches by Scholars. Journal
of the American Society for Information Science 46(2), 146-151.
Korfhage, R. R. (1991). To See or Not to
See--Is That the Query? SIGIR
'91: Proceedings of the Fourteenth
Annual International ACM/SIGIR Conference on Research and Development in
Information Retrieval (pp. 134-141).
Larson,
R. (1991). Classification Clustering,
Probabilistic Information Retrieval, and the Online Catalog. Library
Quarterly 61(2), 133-173.
Library of Congress. (1991). Subject Cataloging
Manual: Subject Headings.
Lubetzky, S. (1960). Code of Cataloging
Rules: Author and Title Entries, an
Unfinished Draft. American
Library Association.
Lubetzky, S. (1969). Principles of Cataloging.
Mann,
T. (1993). Library Research Models.
MARC Formats for Bibliographic Data. (1980 & updates).
Matthews, J. R., Lawrence, G. S., &
McGarry, D.
& Svenonius, E. (September 1991). More on Improved Browsable Displays for Online Subject Access. Information Technology and Libraries. 10(3), 185-191.
Nassif, S. R., Strojwas,
A. J., & Director, S. W. (January 1986). A Methodology for
Worst-Case Analysis of Integrated Circuits. IEEE Transactions on Computer-Aided Design. 5(1), 104-113.
Nelson, M.J. (1988).
Correlation of Term Usage and Term Indexing
Frequencies. Information Processing & Management. 24(5), 541-547.
O'Neill, E. T. & Vizine-Goetz, D. (1989). Bibliographic Relationships: Implications for the Function of the
Catalog. In Svenonius, E. (Ed.).
The Conceptual Foundations of
Descriptive Cataloging (pp. 167-179).
Pease, S. & Gouke, M. N. (July 1982).
Patterns of Use in an Online Catalog and a Card
Catalog. College & Research Libraries. 43, 279-291.
Purgailis Parker, L. M. & Johnson, R. E. (October
1990). Does Order of Presentation Affect
Users' Judgment of Documents? Journal of the American
Society for Information Science.
4, 493-494.
Reed, L. L. (Winter 1993). Locally Loaded Databases and Undergraduate Bibliographic
Instruction. RQ 33(2), 266-273.
Smiraglia, R. P. (1992). Authority Control and
the Extent of Derivative Bibliographic Relationships. Ph.D.
diss.
Solomon, P.
(June 1993). Children's
Information Retrieval Behavior: A Case
Analysis of an OPAC. Journal of the American Society for
Information Science. 44(5),
245-264.
Svenonius, E. (1988). Clustering
Equivalent Bibliographic Records. Annual Review of OCLC Research, July
1987-June 1988 (pp. 6-8).
Walker, S. & de Vere, R. (1990). Improving Subject Retrieval in Online Catalogues: 2.
Relevance Feedback and Query Expansion.
British Library Research Paper 72.
Wiberley, S. E.,
Daugherty, R. A., & Danowski, J. A. (1995). Displaying Online Catalog
Postings: LUIS. Library
Resources & Technical Services 39(3), 247-264.
Wilson,
P. (1989). The Second Objective. In Svenonius, E.
(Ed.) The Conceptual Foundations of Descriptive Cataloging (pp. 5-16).
Yee, M. M. (1995). What is a
Work? Part 4: Cataloging Theorists and a Definition. Cataloging &
Classification Quarterly. 20(2):
3-24.
Yee, M. M. & Layne, S. S. (In press). Online Public Access
Catalogs. Encyclopedia of Library and Information
Science.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
Interruption no.: 4 R = relevant
Intervening
records: 3 N = not
relevant
Interruption size: 1.5
Precision: .65
Fig. 1. Dependent Variable Calculation Example: Display from a Query for
String Matching Display for "A Christmas
Carol" Query
1. A Christmas carol / by Charles
Dickens
2. A Christmas carol [a musical score by Charles Ives]
3. Christmas carol [motion picture]
4. Christmas carol :
a poem / by Sara Teasdale
5. A Christmas carol and other stories
/ by Charles Dickens
6. A Christmas carol. In prose. [by] Charles Dickens
7. A Christmas carol
: part 1, to begin with ... / Charles Dickens
8. A Christmas carol pop-up book [based on
Dickens' classic story]
Records retrieved
in response to string matches are often arranged in alphabetical
order
by title. In addition, most online
catalogs string matches retrieve query
terms
only if they occur at the beginning of a field.
Boolean Matching Display for "christmas" and "carol" Query
1.
A Virgin unspotted (
2. A Christmas carol / by Charles
Dickens
3. The Birds' Christmas carol / by
Kate Douglas Wiggin
4. A Christmas carol [a musical score by Charles Ives]
5. Christmas carol [motion
picture]
6. A Tale of two cities
; A Christmas carol ; The chimes
7.
A Christmas carol. In prose. [by] Charles Dickens
8. Christmas at the Cratchits :
being an excerpt from "A Christmas carol"
9. The carol album :
seven centuries of Christmas music
10. The Jetson's
Christmas carol
11. A Christmas carol and other stories /
Charles Dickens
12. Christmas carol :
a poem / by Sara Teasdale
13. A Christmas carol pop-up book [based on Dickens' classic story]
14. Scrooge [a musical based on A
Christmas carol by Charles Dickens]
15. Dickens's a Christmas carol
16. The first Canadian Christmas carol
Records retrieved
in response to Boolean matches are often arranged essentially
randomly
by internal record number.
Fig. 2.
String and Boolean Matching Displays from a Query for A Christmas Carol
Boolean Multiple Field Matching Display for William
James Query
1. Literary theory and structure:
essays, ......................
2. The grave, a poem. (1743) ......................................... Blair, Robert
3. Irish literary portraits; W. B.
Yeats, .......................... Rodgers, William Robert
4. The flamboyant judge, James D.
Hamlin, ..................... Hamlin,
James D.
Canyon,
5. The history of Western education, .............................. Boyd,
William
Records retrieved
in response to Boolean multiple-field matches are often
arranged
essentially randomly by internal record number.
Boolean Single Field Matching Display for William
James Query
1.
Durant,
William James, 1885-
2.
[search under the name] Durant, Will, 1885- ........ 12 entries
3. Gibson, James William. ............................................ 1 entry
4. James, William
......................................................... 2 entries
5. James, William, 1842-1910
..................................... 30
entries
7. James, William E.
.................................................... 1 entry
James,
William Roderick, 1892-1942
8.
[search under the name] James, Will, 1892-1942
... 4 entries
9. Reid, William James
.................................................. 1 entry
Boolean matches limited to single fields may
be arranged by author name and
make
use of cross references.
String Matching Display for William James Query
1. James, William
......................................................... 2 entries
2. James, William, 1842-1910
..................................... 30
entries
2. James, William E.
.................................................... 2 entries
James,
William Roderick, 1892-1942
4.
[search under the name] James, Will, 1892-1942
... 4 entries
Records retrieved
in response to string author matches are often arranged in
alphabetical
order by author name heading. In
addition, cross references are
provided
to unused forms of an author name.
Fig. 3. String and Boolean Matching
Author Displays
A
CHRISTMAS CAROL / Charles Dickens
|
_________________________________________________________________
| | |
Textual Editions Works
Related to Works About
A Christmas Carol A
Christmas Carol A Christmas
[Call. no. ...]
Carol
For
example: For
example: For example
English language editions Text adaptations Criticism
Editions in other languages Musical adaptations
Audio editions [Call no. ...] Audiovisual adaptations
Fig. 4.
Superwork Hierarchical Tree Structure Summary
Display for A Christmas Carol
Table
1. Summary
of Data Collection Parameters
Types of Catalogs Surveyed: DRA, Dynix,
GEAC Advance, Innovative Interfaces, NOTIS, VTLS
No.
of Catalogs Surveyed: 18
No.
of Author Queries Performed:
String
(S): 90
Boolean
(B): 62*
No.
of Title Queries Performed:
String: 90
Boolean: 90
*Some
catalogs surveyed did not provide Boolean author searches.
Table 2. Match Type Descriptive Statistics (Medians)
|
|
|
|
|
|
|
AUTHORS |
Interruption Number |
Intervening Records |
Average Interruption Size |
Precision |
|
String |
2.0 |
0.0 |
5.0 |
.77 |
|
Boolean |
3.5 |
6.5 |
6.1 |
.26 |
|
|
|
|
|
|
|
WORKS |
Interruption Number |
Intervening Records |
Average Interruption Size |
Precision |
|
String |
4 |
11 |
6.5 |
.21 |
|
Boolean |
5 |
29 |
16.7 |
.07 |
|
|
|
|
|
|
|
SUPERWORKS |
Interruption Number |
Intervening Records |
Average Interruption Size |
Precision |
|
String |
4 |
6 |
3.2 |
.51 |
|
Boolean |
14 |
40 |
2.8 |
.40 |
|
|
|
|
|
|
Table 3. Match Type Results: Authors
|
|
||||||||
|
VARIABLE |
STRING |
PERCENT |
BOOLEAN |
PERCENT |
|
|||
|
|
|
|
|
|
|
|||
|
Interruption No. |
|
|
|
|
|
|||
|
0 to 2 |
81 |
90% |
26 |
42% |
|
|||
|
3 to 5 |
7 |
8% |
8 |
13% |
|
|||
|
6 to 8 |
2 |
2% |
4 |
6% |
|
|||
|
9 to 11 |
0 |
0% |
0 |
0% |
|
|||
|
12 up |
0 |
0% |
24 |
39% |
|
|||
|
|
||||||||
|
Intervening Recs. |
|
|
|
|
|
|||
|
0 to 9 |
85 |
94% |
32 |
52% |
|
|||
|
10 to 19 |
1 |
1% |
3 |
5% |
|
|||
|
20 to 29 |
0 |
0% |
0 |
0% |
|
|||
|
30 up |
4 |
4% |
27 |
44% |
|
|||
|
|
||||||||
|
Avg. Int. Size |
|
|
|
|
|
|||
|
0-9.9 |
64 |
71% |
37 |
60% |
|
|||
|
10-19.9 |
19 |
21% |
6 |
10% |
|
|||
|
20-29.9 |
5 |
6% |
3 |
5% |
|
|||
|
30 up |
2 |
2% |
16 |
26% |
|
|||
|
|
||||||||
|
Precision |
|
|
|
|
|
|||
|
.8-1 |
39 |
43% |
10 |
16% |
|
|||
|
.6-.79 |
24 |
27% |
5 |
8% |
|
|||
|
.4-.59 |
15 |
17% |
2 |
3% |
|
|||
|
.2-.39 |
7 |
8% |
20 |
32% |
|
|||
|
0-.19 |
5 |
6% |
25 |
40% |
|
|||
|
|
|
|
|
|
|
|||
|
Percents may not add up
to 100 due to rounding error. |
|
|
|
|
||||
Table 4. Match Type
Results: Works
|
|
||||||||
|
VARIABLE |
STRING |
PERCENT |
BOOLEAN |
PERCENT |
|
|||
|
|
|
|
|
|
|
|||
|
Interruption No. |
|
|
|
|
|
|||
|
0 to 2 |
27 |
30% |
15 |
17% |
|
|||
|
3 to 5 |
36 |
40% |
37 |
41% |
|
|||
|
6 to 8 |
18 |
20% |
17 |
19% |
|
|||
|
9 to 11 |
6 |
7% |
8 |
9% |
|
|||
|
12 up |
3 |
3% |
13 |
14% |
|
|||
|
|
||||||||
|
Intervening Recs. |
|
|
|
|
|
|||
|
0 to 9 |
44 |
49% |
34 |
38% |
|
|||
|
10 to 19 |
13 |
14% |
7 |
8% |
|
|||
|
20 to 29 |
6 |
7% |
4 |
4% |
|
|||
|
30 up |
27 |
30% |
45 |
50% |
|
|||
|
|
||||||||
|
Avg. Int. Size |
|
|
|
|
|
|||
|
0-9.9 |
65 |
72% |
26 |
29% |
|
|||
|
10-19.9 |
20 |
24% |
26 |
29% |
|
|||
|
20-29.9 |
3 |
3% |
20 |
22% |
|
|||
|
30 up |
2 |
2% |
18 |
20% |
|
|||
|
|
||||||||
|
Precision |
|
|
|
|
|
|||
|
.8-1 |
1 |
1% |
1 |
1% |
|
|||
|
.6-.79 |
2 |
2% |
2 |
2% |
|
|||
|
.4-.59 |
12 |
13% |
2 |
2% |
|
|||
|
.2-.39 |
35 |
39% |
11 |
12% |
|
|||
|
0-.19 |
40 |
44% |
74 |
82% |
|
|||
|
|
|
|
|
|
|
|||
|
Percents may not add up
to 100 due to rounding error. |
|
|
|
|
||||
Table 5. Match Type
Results: Superworks
|
|
||||||||
|
VARIABLE |
STRING |
PERCENT |
BOOLEAN |
PERCENT |
|
|||
|
|
|
|
|
|
|
|||
|
Interruption No. |
|
|
|
|
|
|||
|
0 to 2 |
37 |
41% |
7 |
8% |
|
|||
|
3 to 5 |
21 |
23% |
14 |
16% |
|
|||
|
6 to 8 |
12 |
13% |
9 |
10% |
|
|||
|
9 to 11 |
16 |
18% |
7 |
8% |
|
|||
|
12 up |
4 |
4% |
53 |
59% |
|
|||
|
|
||||||||
|
Intervening Recs. |
|
|
|
|
|
|||
|
0 to 9 |
52 |
58% |
20 |
22% |
|
|||
|
10 to 19 |
9 |
10% |
13 |
14% |
|
|||
|
20 to 29 |
9 |
10% |
9 |
10% |
|
|||
|
30 up |
20 |
22% |
48 |
53% |
|
|||
|
|
||||||||
|
Avg. Int. Size |
|
|
|
|
|
|||
|
0-9.9 |
84 |
93% |
86 |
96% |
|
|||
|
10-19.9 |
5 |
6% |
4 |
4% |
|
|||
|
20-29.9 |
0 |
0% |
0 |
0% |
|
|||
|
30 up |
1 |
1% |
0 |
0% |
|
|||
|
|
||||||||
|
Precision |
|
|
|
|
|
|||
|
.8-1 |
18 |
20% |
16 |
18% |
|
|||
|
.6-.79 |
19 |
21% |
11 |
12% |
|
|||
|
.4-.59 |
13 |
14% |
18 |
20% |
|
|||
|
.2-.39 |
33 |
37% |
21 |
23% |
|
|||
|
0-.19 |
7 |
8% |
24 |
27% |
|
|||
|
|
|
|
|
|
|
|||
|
Percents may not add up
to 100 due to rounding error. |
|
|
|
|
||||
Table 6. Query Type Descriptive Statistics (Medians)
|
|
|
|
|
|
|
QUERY TYPE |
INTERRUPTION NUMBER |
INTERVENING
RECORDS |
AVERAGE INTERRUPTION SIZE |
PRECISION |
|
Author |
2 |
0 |
5.1 |
.61 |
|
Work |
4 |
15.5 |
9.7 |
.15 |
|
Superwork |
7 |
18 |
3.2 |
.47 |
|
|
|
|
|
|
Table 7. Query
Type Results
|
|
||||||||||||
|
VARIABLE |
AUTHORS |
PERCENT |
WORKS |
PERCENT |
SUPERWORKS |
PERCENT |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Interruption No. |
|
|
|
|
|
|
|
|||||
|
0 to 2 |
107 |
70% |
42 |
23% |
44 |
24% |
|
|||||
|
3 to 5 |
15 |
10% |
73 |
41% |
35 |
19% |
|
|||||
|
6 to 8 |
6 |
4% |
35 |
19% |
21 |
12% |
|
|||||
|
9 to 11 |
0 |
0% |
14 |
8% |
23 |
13% |
|
|||||
|
12 up |
24 |
16% |
16 |
9% |
57 |
32% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Intervening Recs. |
|
|
|
|
|
|
|
|||||
|
0 to 9 |
117 |
77% |
78 |
43% |
72 |
40% |
|
|||||
|
10 to 19 |
4 |
3% |
20 |
11% |
22 |
12% |
|
|||||
|
20 to 29 |
0 |
0% |
10 |
6% |
18 |
10% |
|
|||||
|
30 up |
31 |
20% |
72 |
40% |
68 |
38% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Avg. Interr. Size |
|
|
|
|
|
|
|
|||||
|
0-9.9 |
101 |
66% |
91 |
51% |
170 |
95% |
|
|||||
|
10-19.9 |
25 |
16% |
46 |
26% |
9 |
5% |
|
|||||
|
20-29.9 |
8 |
5% |
23 |
13% |
0 |
0% |
|
|||||
|
30 up |
18 |
12% |
20 |
11% |
1 |
1% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Precision |
|
|
|
|
|
|
|
|||||
|
.8-1 |
49 |
32% |
2 |
1% |
34 |
19% |
|
|||||
|
.6-.79 |
29 |
19% |
4 |
2% |
30 |
17% |
|
|||||
|
.4-.59 |
17 |
11% |
14 |
8% |
31 |
17% |
|
|||||
|
.2-.39 |
27 |
18% |
46 |
26% |
54 |
30% |
|
|||||
|
0-.19 |
30 |
20% |
114 |
63% |
31 |
17% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Percents
may not add up to 100 due to rounding error. |
|
|
|
|
|
|
||||||
Table 8. Catalog
Size Statistics (Medians)
|
|
|
|
|
|
|
AUTHORS |
INTERRUPTION NUMBER |
INTERVENING
RECORDS |
AVERAGE INTERRUPTION SIZE |
PRECISION |
|
Small Catalogs |
2 |
0 |
3.7 |
.56 |
|
Medium Catalogs |
2 |
0 |
5.0 |
.65 |
|
Large Catalogs |
2 |
0 |
6.6 |
.59 |
|
|
|
|
|
|
|
WORKS |
INTERRUPTION NUMBER |
INTERVENING
RECORDS |
AVERAGE INTERRUPTION SIZE |
PRECISION |
|
Small Catalogs |
2 |
1.5 |
7.3 |
.14 |
|
Medium Catalogs |
5 |
24.5 |
11.9 |
.12 |
|
Large Catalogs |
7 |
38.5 |
10.9 |
.17 |
|
|
|
|
|
|
|
SUPERWORKS |
INTERRUPTION NUMBER |
INTERVENING
RECORDS |
AVERAGE INTERRUPTION SIZE |
PRECISION |
|
Small Catalogs |
3 |
6.0 |
2.9 |
.47 |
|
Medium Catalogs |
9 |
22.5 |
3.1 |
.44 |
|
Large Catalogs |
11 |
48.5 |
4.0 |
.45 |
|
|
|
|
|
|
Table 9. Catalog
Size Results: Authors
|
|
||||||||||||
|
VARIABLE |
SMALL CATS. |
PERCENT |
MEDIUM CATS. |
PERCENT |
LARGE CATS. |
PERCENT |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Interruption No. |
|
|
|
|
|
|
|
|||||
|
0 to 2 |
38 |
78% |
35 |
71% |
34 |
63% |
|
|||||
|
3 to 5 |
4 |
8% |
4 |
8% |
7 |
13% |
|
|||||
|
6 to 8 |
1 |
2% |
1 |
2% |
4 |
7% |
|
|||||
|
9 to 11 |
0 |
0% |
0 |
0% |
0 |
0% |
|
|||||
|
12 up |
6 |
12% |
9 |
18% |
9 |
17% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Intervening Recs. |
|
|
|
|
|
|
|
|||||
|
0 to 9 |
41 |
84% |
38 |
78% |
38 |
70% |
|
|||||
|
10 to 19 |
0 |
0% |
0 |
0% |
4 |
7% |
|
|||||
|
20 to 29 |
0 |
0% |
0 |
0% |
0 |
0% |
|
|||||
|
30 up |
8 |
16% |
11 |
22% |
12 |
22% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Avg. Interr. Size |
|
|
|
|
|
|
|
|||||
|
0-9.9 |
34 |
69% |
35 |
71% |
32 |
59% |
|
|||||
|
10-19.9 |
8 |
16% |
8 |
16% |
9 |
17% |
|
|||||
|
20-29.9 |
4 |
8% |
3 |
6% |
1 |
2% |
|
|||||
|
30 up |
3 |
6% |
3 |
6% |
12 |
22% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Precision |
|
|
|
|
|
|
|
|||||
|
.8-1 |
18 |
37% |
16 |
33% |
15 |
28% |
|
|||||
|
.6-.79 |
6 |
12% |
11 |
22% |
12 |
22% |
|
|||||
|
.4-.59 |
6 |
12% |
6 |
12% |
5 |
9% |
|
|||||
|
.2-.39 |
9 |
18% |
8 |
16% |
10 |
19% |
|
|||||
|
0-.19 |
10 |
20% |
8 |
16% |
12 |
22% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Percents
may not add up to 100 due to rounding error. |
|
|
|
|
|
|
||||||
Table 10. Catalog Size Results: Works
|
|
||||||||||||
|
VARIABLE |
SMALL CATS. |
PERCENT |
MEDIUM CATS. |
PERCENT |
LARGE CATS. |
PERCENT |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Interruption No. |
|
|
|
|
|
|
|
|||||
|
0 to 2 |
33 |
55% |
8 |
13% |
1 |
2% |
|
|||||
|
3 to 5 |
23 |
38% |
30 |
50% |
20 |
33% |
|
|||||
|
6 to 8 |
4 |
7% |
14 |
23% |
17 |
28% |
|
|||||
|
9 to 11 |
0 |
0% |
6 |
10% |
8 |
13% |
|
|||||
|
12 up |
0 |
0% |
2 |
3% |
14 |
23% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Intervening Recs. |
|
|
|
|
|
|
|
|||||
|
0 to 9 |
45 |
75% |
20 |
33% |
13 |
22% |
|
|||||
|
10 to 19 |
7 |
12% |
8 |
13% |
5 |
8% |
|
|||||
|
20 to 29 |
2 |
3% |
6 |
10% |
2 |
3% |
|
|||||
|
30 up |
6 |
10% |
26 |
43% |
40 |
67% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Avg. Interr. Size |
|
|
|
|
|
|
|
|||||
|
0-9.9 |
38 |
63% |
26 |
43% |
27 |
45% |
|
|||||
|
10-19.9 |
15 |
25% |
15 |
25% |
16 |
27% |
|
|||||
|
20-29.9 |
3 |
5% |
9 |
15% |
11 |
18% |
|
|||||
|
30 up |
4 |
7% |
10 |
17% |
6 |
10% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Precision |
|
|
|
|
|
|
|
|||||
|
.8-1 |
2 |
3% |
0 |
0% |
0 |
0% |
|
|||||
|
.6-.79 |
2 |
3% |
2 |
3% |
0 |
0% |
|
|||||
|
.4-.59 |
5 |
8% |
3 |
5% |
6 |
10% |
|
|||||
|
.2-.39 |
13 |
22% |
13 |
22% |
20 |
33% |
|
|||||
|
0-.19 |
38 |
63% |
42 |
70% |
34 |
57% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Percents
may not add up to 100 due to rounding error. |
|
|
|
|
|
|
||||||
Table 11. Catalog Size Results: Superworks
|
|
||||||||||||
|
VARIABLE |
SMALL CATS. |
PERCENT |
MEDIUM CATS. |
PERCENT |
LARGE CATS. |
PERCENT |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Interruption No. |
|
|
|
|
|
|
|
|||||
|
0 to 2 |
26 |
43% |
9 |
15% |
9 |
15% |
|
|||||
|
3 to 5 |
14 |
23% |
14 |
23% |
7 |
12% |
|
|||||
|
6 to 8 |
9 |
15% |
5 |
8% |
7 |
12% |
|
|||||
|
9 to 11 |
7 |
12% |
7 |
12% |
9 |
15% |
|
|||||
|
12 up |
4 |
7% |
25 |
42% |
28 |
47% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Intervening Recs. |
|
|
|
|
|
|
|
|||||
|
0 to 9 |
38 |
63% |
18 |
30% |
16 |
27% |
|
|||||
|
10 to 19 |
7 |
12% |
9 |
15% |
6 |
10% |
|
|||||
|
20 to 29 |
8 |
13% |
5 |
8% |
5 |
8% |
|
|||||
|
30 up |
7 |
12% |
28 |
47% |
33 |
55% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Avg. Interr. Size |
|
|
|
|
|
|
|
|||||
|
0-9.9 |
59 |
98% |
57 |
95% |
54 |
90% |
|
|||||
|
10-19.9 |
1 |
2% |
3 |
5% |
5 |
8% |
|
|||||
|
20-29.9 |
0 |
0% |
0 |
0% |
0 |
0% |
|
|||||
|
30 up |
0 |
0% |
0 |
0% |
1 |
2% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Precision |
|
|
|
|
|
|
|
|||||
|
.8-1 |
13 |
22% |
12 |
20% |
9 |
15% |
|
|||||
|
.6-.79 |
10 |
17% |
9 |
15% |
11 |
18% |
|
|||||
|
.4-.59 |
11 |
18% |
9 |
15% |
11 |
18% |
|
|||||
|
.2-.39 |
19 |
32% |
19 |
32% |
16 |
27% |
|
|||||
|
0-.19 |
7 |
12% |
11 |
18% |
13 |
22% |
|
|||||
|
|
|
|
|
|
|
|
|
|||||
|
Percents
may not add up to 100 due to rounding error. |
|
|
|
|
|
|
||||||