Modeling Hyperarticulate Speech During Human-Computer Error Resolution
Hyperarticulate speech to computers
remains a poorly understood phenomenon, in spite of its association with
elevated recognition errors. The present research analyzes the type and
magnitude of linguistic adaptations that occur when people engage in error
resolution with computers. A semi-automatic simulation method incorporating a
novel error generation capability was used to collect speech data immediately
before and after system recognition errors, and under conditions varying in
error base-rates. Data on original and repeated spoken input, which were matched
on speaker and lexical content, then were examined for type and magnitude of
linguistic adaptations. Results indicated that speech during error resolution
primarily was longer in duration, including both elongation of the speech
segment and substantial relative increases in the number and duration of pauses.
It also contained more clear speech phonological features and fewer spoken
disfluencies. Implications of these findings are discussed for the development
of more user-centered and robust error handling in next-generation systems.
One aim of recent research on spoken
language interfaces has been to identify hard-to-process sources of linguistic
variability in spontaneous speech, and to develop predictive models accounting
for these phenomena and corresponding interface techniques for reducing their
occurrence (Oviatt, Cohen & Wang, 1994; Oviatt, 1995). Although
hyperarticulate speech to computers has been noted informally and has been
associated with significantly elevated recognition errors (Shriberg, Wade &
Price, 1992), nonetheless to date it remains an ill-defined concept. In
particular, it is unclear what the definition of hyperarticulation is in the
context of human-computer interaction in terms of the type and magnitude of
linguistic adaptations that actually occur, especially during error resolution.
Furthermore, as a potentially difficult source of variability, its impact on
degradation of speech recognition rates is poorly understood. If people do
typically hyperarticulate while trying to resolve system errors, then
recognition rates would be expected to degrade as hyperarticulated speech
departs further from the original training data upon which a recognizer was
developed. To our knowledge, current speech recognizers invariably are trained
on original input, often collected under constrained or unnatural task
conditions, but omit any training on repetitions during simulated or actual
error conditions. Essentially, the widely-advocated concept of "designing
for error" (Lewis & Norman, 1986) has not been applied effectively to
the design of spoken language systems, even though many researchers and
corporate designers regard error resolution to be the most challenging interface
problem facing this technology (Rhyne & Wolf, 1993).
Although literature on hyperarticulate
speech during human-computer error resolution currently is lacking, some
guidance is available from related research on how people adapt their speech
during human-human exchanges when they expect or experience a comprehension
failure from their listener. Systematic modifications have been documented in
parents' speech to infants and children (Fernald et al., 1989) and in speech to
the hearing impaired (Picheny et al., 1986), although the specific
accommodations differ depending on the target population. For example, spoken
adaptations to infants include elevated pitch, expanded pitch range, and stress
on new vocabulary content, phenomena that subserve attentional and teaching
functions. Speech to hearing-impaired individuals involves increased volume,
duration, and a shift from conversational to "clear speech"
articulatory patterns. At present, the profile of hyperarticulatory adaptations
that may occur during human-computer interaction simply is not known.
The goal of the present research was to
identify the type and magnitude of linguistic adaptations that occur during
human-computer interactions involving error resolution. It was hypothesized that
users' repetitions during error resolution would be delivered to contrast
with their original spoken input along selected linguistic dimensions,
and also would be adapted toward clear speech acoustic/phonetic features.
More specifically, in the present study within-subject data were examined for
possible systematic adaptation toward longer duration, greater acoustic
intensity, and increased maximum frequency and frequency range during error
resolution. When speakers repeated the same lexical content after a simulated
recognition error, it also was predicted that clear speech phonological features
would increase and disfluencies correspondingly decrease, with these effects
magnified during a high error base-rate versus a low one. The long-term goal of
this research is the development of a user-centered predictive model of
linguistic adaptation during human-computer error resolution, as well as the
development of improved error handling capabilities for advanced
recognition-based interfaces.
2.1. Subjects, Tasks, and Procedure
Twenty native English speakers, half
male and half female, participated as paid volunteers. Participants represented
a broad range of occupational backgrounds, excluding computer science.
A "Service Transaction
System" was simulated that could assist users with conference registration
and car rental transactions. After a general orientation, people were shown how
to enter information using a stylus to click-to-speak or write directly on
active areas of a form displayed on a Wacom LCD tablet. As input was received,
the system interactively confirmed the propositional content of requests by
displaying typed feedback in the appropriate input slot.
For example, if the system prompted
with Car pickup location:___________ and a person spoke "San
Francisco airport," then "SFO" was displayed immediately
after the utterance was completed. In the case of simulated errors, the system
instead responded with "????" feedback to indicate its failure
to recognize input. In this case, subjects were instructed to re-enter their
information into the same slot until system feedback was correct. Each simulated
error required 1-6 repeats before error resolution was successful, thereby
simulating spiraling in recognition-based systems. A form-based interface
was used during data collection so that the locus of system errors would be
clear to users. They were told that the system was a well-developed one with
extensive processing capabilities, so they could speak normally, express things
as they liked, work at their own pace, and just concentrate on completing their
job.
2.2. Semi-automatic Simulation Method
A semi-automatic simulation technique
was used for collecting data on spoken input during system error handling. Using
this technique, people's input was received by an informed assistant, who
performed the role of responding as a fully functional system. The simulation
software provided support for rapid subject-paced interactions, which averaged
0.4 second delay between a subject's input and system response. Technical
details of the simulation method have been provided elsewhere (Oviatt et al.,
1992), although its random error generation capability was adapted for this
study to simulate the appropriate base-rates and properties of recognition
errors.
2.3. Research Design
The research design was a
within-subject factorial that included the following independent variables: (1)
Error status of speech (Original input; Repeat input after error), (2) Base-rate
of system errors (Low- 6.5% of input slots; High- 20% of slots). All 20 subjects
completed 12 subtasks, half involving a low base-rate of errors and half a high
one, with the order counterbalanced across subjects. In total, data were
collected on 480 simulated errors, of which over 250 involved the same speaker
repeating identical lexical content during the first repetition of a repair
attempt. For these matched utterance pairs, original input provided a baseline
for assessing and quantifying the degree of change along linguistic dimensions
of interest.
2.4. Data Coding and Analysis
Speech input was collected using a
Crown microphone, and all human-computer interaction was videotaped and
transcribed. The speech segments of original and first repeat utterance pairs
were digitized, and software was used to align word boundaries automatically and
label each utterance. Most automatic alignments then were hand-adjusted further
by an expert phonetic transcriber. The ESPS Waves+ signal analysis package was
used to analyze amplitude and frequency, and the OGI Speech Tools were used for
duration. For the present acoustic/prosodic and phonetic analyses, only the
first repair was compared with original input, although disfluency rates were
based on all spoken repetitions that occurred during error resolution.
Duration. The
following were summarized: (1) total utterance duration, (2) total speech
segment duration (i.e., total duration minus pause duration), (3) total pause
duration for multi-word utterances, and (4) total number of pauses for
multi-word utterances. No attempt was made to code pauses less than 10 msec in
duration. Due to difficulty locating their onset, utterance-initial voiceless
stops and affricates were arbitrarily assigned a 20-msec closure, and no pauses
were coded as occurring immediately before utterance-medial voiceless stops and
affricates.
Intensity/Amplitude. Maximum
intensity was computed at the loudest point of each utterance using ESPS Waves+,
and then was converted to decibels. Values judged to be extraneous non-speech
sounds were excluded.
Fundamental Frequency. Spoken
input was coded for maximum F0, minimum F0, F0 range, and F0 average. The
fundamental frequency tracking software in ESPS Waves+ was used to calculate
values for voiced regions of the digitized speech signal. Pitch minima and
maxima were calculated automatically by program software, and then adjusted to
correct for pitch tracker errors such as spurious doubling and halving,
interjected non-speech sounds, and extreme glottalization affecting " 5
tracking points. To avoid skewing due to line noise, only voiced speech within
the coder-corrected F0 range were used to calculate F0 mean.
Phonological Alternations. Phonological
changes within original-repeat utterance pairs that could be coded reliably by
ear without a spectrogram were categorized as either representing a shift from
conversational-to-clear speech style, or vice-versa. The following contrasting
categories were coded: (1) released and unreleased plosives, (2) unlenited
coronal plosives and alveolar flaps, and (3) presence versus absence of
segments. Alveolar flaps, deleted segments, and unreleased stops were considered
characteristic of conversational speech, whereas unlenited coronal plosives,
undeleted segments, and audibly released stops were indices of clear speech.
Phenomena considered difficult to code reliably were not included, such as
glottalization and glottal stop insertion.
Disfluencies. Spoken
disfluencies were totaled for each subject and condition during original spoken
input as well as errors (i.e., including all 1-6 repeats), and then were
converted to a rate per 100 words. The following types of disfluencies were
coded: (1) content self-corrections, (2) false starts, (3) repetitions, and (4)
filled pauses. For coding details, see Oviatt (1995).
Reliability. For
all measures reported except amplitude, 10% to 100% of the data were
second-scored, with attention to sampling equally across conditions.
Acoustic/prosodic and phonological alternation measures were scored by linguists
familiar with the dependent measures and relevant software analysis tools. For
discrete classifications, such as number of pauses, disfluencies, and
phonological alternations, all inter-rater reliabilities exceeded 87%. For
phonological alternations, only cases agreed upon by both scorers were analyzed.
For fundamental frequency, the inter-rater reliability for minimum F0 was 90%
with a 0 hz departure, and for maximum F0 80% with 3 hz departure. For duration,
pause length was an 80% match with less than a 50 msec departure, and total
utterance duration an 80% match with less than 40 msec departure.
3.1. Duration
Spoken utterances in this corpus tended
to be brief fragments averaging two words, and ranging from 1-13 words in
length. When the error rate was low, total utterance duration averaged 1544 msec
and 1802 msec during original and repeat input, a gain of 16.5%, a significant
increase by paired t test on log transformed data, t = 7.05 (df = 49), p <
.001, one-tailed. When the base-rate of errors was high, the total utterance
duration averaged 1624 msec during original input, increasing to 1866 msec
during repeat input, a gain of 15%, which again was significant by paired t test
on log transformed data, t = 10.71 (df = 208), p < .001, one-tailed.
Speech Segment Duration. Analyses
revealed an increase in the total speech segment from an average of 1463 msec
during original input to 1653 msec during repeat input when the error rate was
low, a 13% gain, significant by paired t test on log transformed data, t = 7.44
(df = 50), p < .001, one-tailed. During a high error-rate, it also increased
from 1515 msec during original input to 1686 msec during repeat input, an 11.5%
gain, significant by paired t test on log transformed data, t = 10.20 (df =
215), p < .001, one-tailed.
Pause Duration. The
total pause duration of multi-word utterances increased significantly from an
average of 112 to 209 msec between original and repeat input when the error rate
was low, an 86.5% gain, significant by paired t test on log transformed data, t
= 1.89 (df = 11), p < .05, one-tailed, and it again increased significantly
from an average of 159 msec during original input to 261 msec during repeat
input when the error rate was high, a 64% gain, significant by paired t test on
log transformed data, t = 3.17 (df = 36), p < .002, one-tailed.
Number of Pauses. The
average number of pauses per multi-word utterance also increased significantly
from an average of 0.49 during original input to 1.06 during repeated speech
when the error rate was low, a 116% gain, significant by Wilcoxon Signed Ranks
test, z = 2.52 (N = 12), p < .006, one-tailed. During high error rates, it
increased from 0.57 to 0.95 during repetitions, or 67%, significant by Wilcoxon,
z = 3.03 (N = 16), p < .001, one-tailed. Figure 1 illustrates the average
relative gains in pause duration (73%) and pause interjection (91%) in relation
to average segment elongation (12%) in two typical utterances from the corpus.
To test for elongation of individual pauses (i.e., independent of interjecting
new ones), original and repeat utterance pairs matched on total number of pauses
were compared for total pause length. This analysis confirmed that pauses were
elongated significantly more in repeat utterances, paired t = 1.71, (df = 27), p
< .05, one-tailed.
Original:
"San Diego"
Repeat: "San
Diego"
Original:
"9 2 106"
Repeat:
"9 2 1 0 6"
Figure 1: Pause
elongation (top) and pause interjection (bottom) in matched original-repeat
utterance pairs
3.2. Intensity/Amplitude
The maximum intensity averaged 70.9 dB
and 71.2 dB during original and repeat input when the error base-rate was low,
and 71.2 dB and 71.0 dB when it was high. Paired t tests on original versus
repeat speech revealed no significant change in intensity in either the low or
high error conditions, paired t = 1.14 (NS) and t = 1.59 (NS), respectively.
3.3. Fundamental Frequency
Pitch Maximum.
The maximum F0 did not differ significantly between original and repeat input in
either the low or high error conditions, by paired t test on log transformed
data, t = 1.58 (NS, one-tailed) and t = 1.35 (NS), respectively. Reanalysis
subdivided by gender confirmed this lack of a significant change in maximum F0
in both females and males.
Pitch Minimum. The
minimum F0 did not differ in the low error-rate condition, which averaged 111.3
hz and 110.4 hz, paired t < 1. However, minimum F0 dropped between original
and repeat input in the high error-rate condition, averaging 122.2 hz and 119.5
hz, paired t test on log transformed data, t = 1.96, (df = 221), p < .05,
two-tailed. Although no significant differences were revealed in minimum F0 for
female speech, or for male speech when errors were low, minimum F0 dropped
marginally in male speech from 94.4 hz to 91.3 hz when the error rate was high,
t = 1.86, (df = 123), p < .065, two-tailed.
Pitch Range.
The F0 range did not differ significantly between original and repeat speech for
either the low or high error conditions, t < 1. Reanalysis subdivided by
gender revealed only that, when repeating their input during error resolution,
female speech was significantly more expanded in pitch range when the error rate
was high rather than low, paired t = 3.89 (df = 10), p < .0015, one-tailed.
Male speech showed no expansion of F0 range under any condition.
Pitch Average. Average
F0 dropped significantly between original and repeat input in the high
error-rate condition, averaging 164.7 hz and 162.4 hz, paired t = 2.83 (df =
206), p < .005, two-tailed, although no difference was found in the low
error-rate condition. No differences were found in females' speech, or in males'
mean F0 when the error rate was low, although it dropped significantly in male
speech from 128.9 hz to 126.6 hz during repeat input when the error rate was
high, t = 2.47 (df = 108), p < .015, two-tailed.
3.4. Phonological Alternations
Approximately 9% of first repetitions
in this corpus contained phonological alternations, and 93% of subjects altered
their speech phonologically during error resolution at some point. Table 1
summarizes the number and type of alternations observed for each subject who had
a minimum of 12 scorable utterance pairs, as well as their classification by
direction of shift with respect to clear speech.
Table 1: Number and type of phonological alternations involving a shift toward conversational versus clear speech, listed by subject (a- unreleased t > t; b- alveolar flap > coronal plosive; c- nasal stop or flap > nt sequence; d- nt sequence > nasal stop or flap)
The majority of subjects, or 86% who adapted their speech, shifted from a conversational to clear speech style rather than the reverse, a significant difference by Wilcoxon Signed Ranks test, T+ = 87.5 (N= 13), p < .001, one-tailed. After correcting for the difference in total errors sampled, analysis of clear-speech adaptations revealed that they were significantly more prevalent in the high error-rate condition than the low one, as indicated by Wilcoxon Signed Ranks test, T+ = 67 (N = 12), p < .015, one-tailed. As shown in Figure 2, a comparison of the average rate of clear-speech adaptations per 100 words revealed a 163% increase, from 0.95 in the low error-rate condition to 2.50 in the high one.
3.5. Disfluencies
The disfluency rate during original
spoken input averaged 0.78 disfluencies per 100 words, the same as previously
reported for structured interfaces (Oviatt, 1995). However, it dropped
significantly to 0.37 when repeating during error resolution, paired t = 2.03 (df
= 19), p < .03, one-tailed. In the low error-rate condition, disfluencies
averaged 0.85 per 100 words, similar to 0.78 previously reported with low errors
(Oviatt, 1995). However, it dropped significantly to 0.53 when the error-rate
was high, paired t = 1.90 (df = 19), p < .04, one-tailed. Figure 2
illustrates the inverse relation between clear-speech phonological alternations
and spoken disfluencies as a function of error base-rate. Since the disfluency
rate is influenced by both original versus repeat input and by the overall
base-rate of errors, and since the high error-rate condition contained 3-fold
more errors than the low one, a comparison also was evaluated between the high
and low conditions with all errors removed (i.e., comparing only baseline
original input). In this more controlled comparison, the disfluency rate was
confirmed to decrease significantly when the error-rate became high, paired t =
2.38 (df = 19), p < .03, one-tailed.
Figure 2:
Rate of spoken disfluencies and phonological alternations per 100 words as a
function of error base-rate
During error resolution with computers,
human speech shifts to become lengthier and more clearly articulated. An
increase in total utterance duration was documented during repetitions,
including an average 12% elongation in the speech segment, a 73% elongation in
pause duration, and interjection of 91% more utterance-internal pauses. Clearly,
pause characteristics constituted the most salient relative change during
repetitions. In addition, the phonological features of repeat speech adapted
toward an audibly clearer articulation pattern. In the present corpus, the most
frequently observed changes included fortition of alveolar flaps to coronal
plosives (e.g., eIReIt
changing to eIt|eIt),
and shifts to unreduced nt sequences (e.g., twER‚i
to twEnti).
This shift to clear speech during error resolution also corresponded with a drop
in spoken disfluencies. When the error rate was high, which required correcting
1 in 5 slots rather than 1 in 15, both the increase in clear-speech adaptations
and the decrease in disfluencies were accentuated further. Female pitch also was
found to expand in range when resolving errors during a chronically high
error-rate, an adaptation also found in female speech to children. However,
unlike some error resolution between humans, speakers did not alter their volume
when resolving errors with the computer.
The hyperarticulate speech documented
in this research presents a potentially difficult source of variability that may
degrade the performance of current speech recognizers, in particular
complicating recognizers' ability to resolve errors gracefully. This research
has implications for the development of more user-centered recognition
algorithms, the collection of more realistic speech data with interactive
systems varying in error base-rate, and the design of recognizers specialized
for error handling that may become part of a coordinated system of multiple
recognizers. Further work is needed on quantitative modeling of the durational
and articulatory phenomena identified in this study, which could contribute to
establishing more user-centered and robust next-generation spoken language
systems. Additional research also is needed on spoken adaptations following
other types of recognition errors, such as substitutions, to assess the
generality of findings identified in this work. Finally, the fuller spectrum of
error handling options and potential benefits needs to be explored when speech
is incorporated as one of several input modes in a multimodal interface (see
Oviatt & VanGent, 1996, this volume).
1. Fernald, A., Taeschner, T., Dunn,
J., Papousek, M., De Boysson-Bardies, B. & Fukui, I. A cross-language study
of prosodic modifications in mothers' and fathers' speech to preverbal infants, Journal
of Child Language, 16 (1989), 477-501.
2. Lewis, C. & Norman, D. A.
Designing for error, in User-Centered System Design (ed. by D. A. Norman
& S. W. Draper), Lawrence Erlbaum: Hillsdale, N. J., 1986, 411-432.
3. Oviatt, S. L. Predicting spoken
disfluencies during human-computer interaction, Computer Speech and Language,
1995, 9, 1, 19-35.
4. Oviatt, S. L., Cohen, P. R., Fong,
M. W., & Frank, M. P., A rapid semi-automatic simulation technique for
investigating interactive speech and handwriting, Proceedings of the
International Conference on Spoken Language Processing (ed. by J. Ohala et
al.), University of Alberta, 1992, vol. 2, 1351-1354.
5. Oviatt, S. L., Cohen, P. R. &
Wang, M. Q. Toward interface design for human language technology: Modality and
structure as determinants of linguistic complexity, Speech Communication,
European Speech Communication Association, 1994, vol. 15, nos. 3-4, 283-300.
6. Oviatt, S. L. & VanGent, R.
Error resolution during multimodal human-computer interaction, Proceedings of
the International Conference on Spoken Language Processing, 1996, in
press.
7. Picheny, M. A., Durlach, N. I. &
Braida, L. D. Speaking clearly for the hard of hearing II: Acoustic
characteristics of clear and conversational speech, Journal of Speech and
Hearing Research, 29, 1986, 434-446.
8. Rhyne, J. R. & Wolf, C. G.
Recognition-based user interfaces, in Advances in Human-Computer Interaction,
Vol. 4 (ed. by H. R. Hartson & D. Hix), Ablex Publishing Corp.: Norwood,
N. J., 1993, 191-250.
9. Shriberg, E., Wade, E. & Price, P. Human-machine problem solving using spoken language systems (SLS): Factors affecting performance and user satisfaction, Proceedings of the DARPA Speech and Natural Language Workshop, Morgan Kaufmann Publishers: San Mateo, Ca., 1992, 49-54.