A man and a woman playing XBox Kinect game with their bodies.

Chapter 11

Body-Based Input

In the last chapter , I introduced Weiser’s vision of ubiquitous computing, and argued that part of fulfilling it was fully exploiting the versatility of our hands. In this chapter, we explore all of the sources of human action other than hands — speech, eye gaze, limbs, and entire bodies — and the range of ways this action has been channeled as input to computers. As we shall see, many of the same gulfs of execution and evaluation arise with body-based input as does with hands. Part of this is that all body-based input is based in probabilistically recognizing the actions of muscles. When we speak or make other sounds, we use the muscles in our throat, mouth, and face. When we look with our eyes, we use muscles to shift our gaze, to blink, to squint, and keep our eyes closed. We use our muscles to move and arrange our limbs. Perhaps the only action we perform that isn’t driven by muscles is thought. Our brain drives our muscles through electrical signals, which are detectable through our skin through techniques such as EMG and EEG. All of these forms of action are central to our existence in a physical world. Researchers have explored how to leverage these muscle-based signals to act in virtual worlds.

A broad, but not very deep explanation of speech recognition technology.

Aside from our hands, one of the most versatile ways we use our muscles to engage with our environment is to use our voice. We use it to speak, to hum, to sing, and to make other non-verbal auditory sounds to communicate with humans and other living things. Why not computers?

Voice-based interactions have been a dream for decades, and long-imagined in science fiction. Only after decades of progress on speech recognition, spanning the early 1950’s at Bell Labs, to continued research today in academia, did voice interfaces become reliable enough for interaction. Before the ubiquitous digital voice assistants on smartphones and smart speakers, there were speech recognition programs that dictated text and phone-based interfaces that listened to basic commands and numbers.

The general process for speech recognition begins with an audio sample. The computer records some speech, encoded as raw sound waves, just like that recorded in a voice memo application. Before speech recognition algorithms even try to recognize speech, they apply several techniques to “clean” the recording, trying to distinguish between background and foreground sound, removing background sound. They segment sound into utterances separated by silence, which can be more easily classified. They rely on large databases of phonetic patterns, which define the kinds of sounds that are used in a particular natural language. They use machine learning techniques to try to classify these phonetic utterances. And then finally, once these phonemes are classified, they try to recognize sequences of phonetic utterances as particular words. More advanced techniques used in modern speech recognition may also analyze an entire sequence of phonetic utterances to try to infer the most likely possible nouns, noun phrases, and other parts of sentences, based on the rest of the content of the sentence.

Whether speech recognition works well depends on what data is used to train the recognizers. For example, if the data only includes people with particular accents, recognition will work well for those accents and not others. The same applies to pronunciation variations: for example, for a speech recognizer to handle both the American and British spellings pronunciations of “aluminum” and “aluminium”, both pronunciations need to be in the sample data. This lack of diversity in training data is a significant source of recognition failure. It is also a source of gulf of executions, is it is not always clear when speaking to a recognition engine what it is trained on and therefore what enunciation might be necessary to get it to properly recognize a word or phrase.

Of course, even accounting for diversity, technologies for recognizing speech are insufficient on their own to be useful. Early HCI research explored how to translate graphical user interfaces into speech-based interfaces to make GUIs accessible to people who were blind or low vision ¹⁶

Elizabeth D. Mynatt and W. Keith Edwards (1992). Mapping GUIs to auditory interfaces. ACM Symposium on User Interface Software and Technology (UIST).

. The project that finally pushed voice interactions into the mainstream was the DARPA-funded CALO research project , which, as described by it’s acronym, “Cognitive Assistant that Learns and Organizes,” sought to create intelligent digital assistants using the latest advances in artificial intelligence. One branch of this project at SRI International spun out some of the speech recognition technology, building upon decades of advances to support real-time speech command recognition by applying advances in deep learning from the late 1990’s. The eventual result was assistants like Siri, Alexa, and OK Google, which offer simple command libraries, much like a menu in a graphical user interface, but accessed with voice instead of a pointing device. Speech in these systems was merely the input device; they still required careful designs of commands, error-recovery, and other aspects of user interface design to be usable. And they pose distinct learnability challenges: a menu in a GUI, for example, plainly shows what commands are possible, but a voice interface’s available commands are essentially invisible.

Voice can be useful far beyond giving simple commands. Researchers have explored its use in multi-modal contexts in which a user both points and gives speech commands to control complex interfaces ²^,²³

Richard A. Bolt (1980). "Put-that-there": Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGGRAPH '80).

. Some have applied voice interactions to support more rapid data exploration by using natural language queries and “ambiguity widgets” to show ranges of possible queries ⁵ . Others have used non-verbal voice humming to offer hands free interaction with creative tools, like drawing applications ⁸ . Some have even explored whispering to voice assistants, finding that this different speech modality can surface emotions like creepiness and intimacy ¹⁸ . Each of these design explorations consider new possible interactions, surfacing new potential benefits.

Of course, in all of these voice interactions, there are fundamental gulfs of execution and evaluation. Early work on speech interfaces ²⁵

Nicole Yankelovich, Gina-Anne Levow, Matt Marx (1995). Designing SpeechActs: issues in speech user interfaces. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).

found that users expected conversational fluency in machines, that they struggled to overcome recognition errors, and that the design of commands for GUIs often did not translate well to speech interfaces. They also found that speech itself poses cognitive demands on working memory to form correct commands, retain the state of the machine, and anticipate the behavior of the conversational agent. These all further aggravated recognition errors. All of these gulfs persist in modern speech interfaces, though to a lesser extent as speech recognition has improved with ever larger data sets.

Another useful muscle in our bodies are those in our eyes that control our gaze. Gaze is inherently social information that refers to inferences about where someone is looking based on the position of their pupils relative to targets in a space. Gaze tracking technology usually involves machine vision techniques, using infrared illumination of pupils, then time series analysis and individual calibration to track pupil movements over time. These can have many useful applications in things like virtual reality . For example, in a game, gaze detection might allow non-player characters to notice when you are looking at them, and respond to you. In the same way, gaze information might be propagated to a player’s avatar, moving it’s eyes in the player’s eyes move, helping other human players in social and collaborative games know when a teammate is looking your way.

Gaze recognition techniques are similar in concept to speech recognition in that they rely on machine learning and large data sets. However, rather than training on speech utterances, these techniques use data sets of eyes to detect and track the movement of pupils over time. The quality of this tracking depends heavily on the quality of cameras, as pupils are small and their movements are even smaller. Pupil movement is also quite fast, and so cameras need to record at a high frame rate to monitor movement. Eyes also have very particular movements, such as saccades , which are ballistic motions that abruptly shift from one point of fixation to another. Most techniques overcome the challenges imposed by these dynamic properties of eye movement by aggregating and averaging movement over time and using that as input.

Researchers have exploited gaze detection as a form of hands-free interaction, usually discriminating between looking and acting with dwell-times : if someone looks at something long enough, it is interpreted as an interaction. For example, some work has used dwell times and the top and bottom of a window to support gaze-controlled scrolling through content ¹¹

Manu Kumar and Terry Winograd (2007). Gaze-enhanced scrolling techniques. ACM Symposium on User Interface Software and Technology (UIST).

. Other techniques analyze the targets in an interface (e.g., links in a web browser) and make them controllable through gaze and dwell ¹³ . Some researchers have even extended this to multiple displays, making it easier to quickly interact with different devices at a distance ¹² . One way to avoid waiting for dwell times is to combine gaze with other interaction. For example, some systems have combined gaze and touch, as in the video below, to use gaze for pointing and touch for action ²⁰ :

Combining gaze and touch for multimodal interactions.

Unfortunately, designing for gaze can require sophisticated knowledge of constraints on human perception. For example, one effort to design a hands-free, gaze-based drawing tool for people with motor disabilities ended up requiring the careful design of commands to avoid frustrating error handling and disambiguation controls ⁹

Anthony J. Hornof and Anna Cavender (2005). EyeDraw: enabling children with severe motor impairments to draw with their eyes. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).

. Other work has explored peripheral awareness as a component of gaze, and found that it is feasible to detect degrees of awareness, but that incorporating feedback about awareness into designs can lead to dramatic, surreal, and even scary experiences ¹ . Like other forms of recognition-based input, without significant attention to supporting recovery from recognition error, people will face challenges trying to perform the action they want efficiently and correctly. The interplay between gulfs of execution and evaluation with gaze-based interfaces can be particularly frustrating, as we rarely have perfect control over where our eyes are looking.

Whereas our hands, voices, and eyes are all very specialized for social interaction, the other muscles in our limbs are more for movement. That said, they also offer rich opportunities for interaction. Because muscles are activated using electrical signals, techniques like electromyography (EMG) can be used to sense muscle activity through electrodes placed on the skin. For example, this technique was used to sense motor actions in the forearm, discriminating between different grips to support applications like games ²²

T. Scott Saponas, Desney S. Tan, Dan Morris, Ravin Balakrishnan, Jim Turner, and James A. Landay (2009). Enabling always-available input with muscle-computer interfaces. ACM Symposium on User Interface Software and Technology (UIST).

Sensing forearm grip.

Other ideas focused on the movement of the tongue muscles using optical sensors and a mouth retainer ²¹

T. Scott Saponas, Daniel Kelly, Babak A. Parviz, and Desney S. Tan (2009). Optically sensing tongue gestures for computer input. ACM Symposium on User Interface Software and Technology (UIST).

and tongue joysticks for menu navigation ²⁴ . Some focused on the breath, localizing where a person was blowing at a microphone installed on a laptop or computer screen ¹⁹ . Others have defined entire gesture sets for tongue movement that can be detected non-invasively through wireless X-band Doppler movement detection ⁶ .

While these explorations of limb-based interactions have all been demonstrated to be feasible to agree, as with many new interactions, they impose many gulfs of execution and evaluation that have not yet been explored.

How can users learn what a system is trying to recognize about limb movement?
How much will users be willing to train the machine learned classifiers used to recognize movement?
When errors inevitably occur, how can we support people in recovering from error, but also improving classification in the future?

Other applications have moved beyond specific muscles to the entire body and its skeletal structure. The most widely known example of this is the Microsoft Kinect , shown in the image at the beginning of this chapter. The Kinect used a range of cameras and infrafred projectors to create a depth map of a room, including the structure and posture of the people in this room. Using this depth map, it build basic skeletal models of players, including the precise position and orientation of arms, legs, and heads. This information was then available in real-time for games to use as sources of input (e.g., mapping the skeletal model on to an avatar, using skeletal gestures to invoke commands).

But Kinect was just one technology that emerged from a whole range of explorations of body-based sensing. For example, researchers have explored whole-body gestures, allowing for hands-free interactions. This technique uses the human body as an antenna for sensing, requiring no instrumentation to the environment, and only a little instrumentation of the user ⁴

Gabe Cohn, Daniel Morris, Shwetak Patel, Desney Tan (2012). Humantenna: using the body as an antenna for real-time whole-body interaction. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).

Humantenna’s whole body gestures.

Similar techniques sense a person tapping patterns on their body ³

Xiang 'Anthony' Chen and Yang Li (2016). Bootstrapping user-defined body tapping recognition with offline-learned probabilistic representation. ACM Symposium on User Interface Software and Technology (UIST).

, sensing skin deformation through a band that also provide tactile feedback ¹⁷ , and the detection of entire human bodies through electric field distortion requiring no instrumentation of the user ¹⁵ . Other techniques coordinate multiple parts of the body, such as gaze detection and foot movement ¹⁰ . Even more extreme ideas include entirely imaginary interfaces, in which users perform spatial gestures with no device and no visual feedback, relying on machine vision techniques to map movement to action ⁷ . Some of these techniques are beginning to be commercialized in products like the Virtuix Omni , which is an omni-directional treadmill that lets players walk and run in virtual worlds while staying in place:

An omni-directional treadmill that tries to solve the problem of body movement in virtual worlds.

Whole body interactions pose some unique gulfs of execution and evaluation. For example, a false positive for a hand or limb-based gesture might mean unintentionally invoking a command. We might not expect these to be too common, as we often use our limbs only when we want to communicate. But we use our bodies all the time, moving, walking, and changing our posture. If recognition algorithms aren’t tuned to filter out the broad range of things we do with our body in every day life, the contexts in which body-based input might be used might be severely limited.

This survey of body-based input techniques show a wide range of possible ways of providing input. However, like any new interface ideas, there are many unresolved questions about how to train people to use them and how to support error recovery. There are also many questions about the contexts in which such human-computer interaction might be socially acceptable. For example, how close do we want to be with computers. New questions about human-computer integration ¹⁴

Florian Floyd Mueller, Pedro Lopes, Paul Strohmeier, Wendy Ju, Caitlyn Seim, Martin Weigel, Suranga Nanayakkara, Marianna Obrist, Zhuying Li, Joseph Delfa, Jun Nishida, Elizabeth M. Gerber, Dag Svanaes, Jonathan Grudin, Stefan Greuter, Kai Kunze, Thomas Erickson, Steven Greenspan, Masahiko Inami, Joe Marshall, Harald Reiterer, Katrin Wolf, Jochen Meyer, Thecla Schiphorst, Dakuo Wang, and Pattie Maes (2020). Next Steps for Human-Computer Integration. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).

have been raised, pondering the many ways that we might unify ourselves with computers:

Symbiosis might entail humans and digital technology working together, in which software works on our behalf, and in return, we maintain and improve it. For example, examples above where computers exhibit agency of their own (e.g., noticing things that we miss, and telling us) are a kind of symbiosis.
Fusion might entail using computers to extend our bodies and bodily experiences. Many of the techniques described above might be described as fusion, where they expand our abilities.

Do you want this future? If so, the maturity of these ideas are not quite sufficient to bring them to market. If not, what kind of alternative visions might you imagine to prevent these futures?

References

Josh Andres, m.c. Schraefel, Nathan Semertzidis, Brahmi Dwivedi, Yutika C Kulwe, Juerg von Kaenel, Florian 'Floyd' Mueller (2020). Introducing Peripheral Awareness as a Neurological State for Human-Computer Integration. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
Richard A. Bolt (1980). "Put-that-there": Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGGRAPH '80).
Xiang 'Anthony' Chen and Yang Li (2016). Bootstrapping user-defined body tapping recognition with offline-learned probabilistic representation. ACM Symposium on User Interface Software and Technology (UIST).
Gabe Cohn, Daniel Morris, Shwetak Patel, Desney Tan (2012). Humantenna: using the body as an antenna for real-time whole-body interaction. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
Tong Gao, Mira Dontcheva, Eytan Adar, Zhicheng Liu, Karrie G. Karahalios (2015). DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST '15).
Mayank Goel, Chen Zhao, Ruth Vinisha, Shwetak N. Patel (2015). Tongue-in-Cheek: Using Wireless Signals to Enable Non-Intrusive and Flexible Facial Gestures Detection. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15).
Sean Gustafson, Daniel Bierwirth, Patrick Baudisch (2010). Imaginary interfaces: spatial interaction with empty hands and without visual feedback. ACM Symposium on User Interface Software and Technology (UIST).
Susumu Harada, Jacob O. Wobbrock, James A. Landay (2007). Voicedraw: a hands-free voice-driven drawing application for people with motor impairments. ACM SIGACCESS Conference on Computers and Accessibility.
Anthony J. Hornof and Anna Cavender (2005). EyeDraw: enabling children with severe motor impairments to draw with their eyes. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
Konstantin Klamka, Andreas Siegel, Stefan Vogt, Fabian Göbel, Sophie Stellmach, Raimund Dachselt (2015). Look & Pedal: Hands-free Navigation in Zoomable Information Spaces through Gaze-supported Foot Input. ACM on International Conference on Multimodal Interaction (ICMI).
Manu Kumar and Terry Winograd (2007). Gaze-enhanced scrolling techniques. ACM Symposium on User Interface Software and Technology (UIST).
Christian Lander, Sven Gehring, Antonio Krüger, Sebastian Boring, Andreas Bulling (2015). GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays. ACM Symposium on User Interface Software and Technology (UIST).
Christof Lutteroth, Moiz Penkar, Gerald Weber (2015). Gaze vs. mouse: A fast and accurate gaze-only click alternative. ACM Symposium on User Interface Software and Technology (UIST).
Florian Floyd Mueller, Pedro Lopes, Paul Strohmeier, Wendy Ju, Caitlyn Seim, Martin Weigel, Suranga Nanayakkara, Marianna Obrist, Zhuying Li, Joseph Delfa, Jun Nishida, Elizabeth M. Gerber, Dag Svanaes, Jonathan Grudin, Stefan Greuter, Kai Kunze, Thomas Erickson, Steven Greenspan, Masahiko Inami, Joe Marshall, Harald Reiterer, Katrin Wolf, Jochen Meyer, Thecla Schiphorst, Dakuo Wang, and Pattie Maes (2020). Next Steps for Human-Computer Integration. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
Mujibiya, A., & Rekimoto, J. (2013). Mirage: exploring interaction modalities using off-body static electric field sensing. ACM Symposium on User Interface Software and Technology (UIST).
Elizabeth D. Mynatt and W. Keith Edwards (1992). Mapping GUIs to auditory interfaces. ACM Symposium on User Interface Software and Technology (UIST).
Masa Ogata, Yuta Sugiura, Yasutoshi Makino, Masahiko Inami, Michita Imai (2013). SenSkin: adapting skin as a soft interface. ACM Symposium on User Interface Software and Technology (UIST).
Emmi Parviainen (2020). Experiential Qualities of Whispering with Voice Assistants. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
Shwetak N. Patel and Gregory D. Abowd (2007). Blui: low-cost localized blowable user interfaces. ACM Symposium on User Interface Software and Technology (UIST).
Ken Pfeuffer and Hans Gellersen (2016). Gaze and Touch Interaction on Tablets. ACM Symposium on User Interface Software and Technology (UIST).
T. Scott Saponas, Daniel Kelly, Babak A. Parviz, and Desney S. Tan (2009). Optically sensing tongue gestures for computer input. ACM Symposium on User Interface Software and Technology (UIST).
T. Scott Saponas, Desney S. Tan, Dan Morris, Ravin Balakrishnan, Jim Turner, and James A. Landay (2009). Enabling always-available input with muscle-computer interfaces. ACM Symposium on User Interface Software and Technology (UIST).
Vidya Setlur, Sarah E. Battersby, Melanie Tory, Rich Gossweiler, Angel X. Chang (2016). Eviza: A Natural Language Interface for Visual Analysis. ACM Symposium on User Interface Software and Technology (UIST).
Ronit Slyper, Jill Lehman, Jodi Forlizzi, Jessica Hodgins (2011). A tongue input device for creating conversations. ACM Symposium on User Interface Software and Technology (UIST).
Nicole Yankelovich, Gina-Anne Levow, Matt Marx (1995). Designing SpeechActs: issues in speech user interfaces. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).

Speech

Gaze

Limbs

Body

References