Body-Based Input
In the last chapter , I introduced Weiser’s vision of ubiquitous computing, and argued that part of fulfilling it was fully exploiting the versatility of our hands. In this chapter, we explore all of the sources of human action other than hands — speech, eye gaze, limbs, and entire bodies — and the range of ways this action has been channeled as input to computers. As we shall see, many of the same gulfs of execution and evaluation arise with body-based input as does with hands. Part of this is that all body-based input is based in probabilistically recognizing the actions of muscles. When we speak or make other sounds, we use the muscles in our throat, mouth, and face. When we look with our eyes, we use muscles to shift our gaze, to blink, to squint, and keep our eyes closed. We use our muscles to move and arrange our limbs. Perhaps the only action we perform that isn’t driven by muscles is thought. Our brain drives our muscles through electrical signals, which are detectable through our skin through techniques such as EMG and EEG. All of these forms of action are central to our existence in a physical world. Researchers have explored how to leverage these muscle-based signals to act in virtual worlds.
Speech
Aside from our hands, one of the most versatile ways we use our muscles to engage with our environment is to use our voice. We use it to speak, to hum, to sing, and to make other non-verbal auditory sounds to communicate with humans and other living things. Why not computers?
Voice-based interactions have been a dream for decades, and long-imagined in science fiction. Only after decades of progress on speech recognition, spanning the early 1950’s at Bell Labs, to continued research today in academia, did voice interfaces become reliable enough for interaction. Before the ubiquitous digital voice assistants on smartphones and smart speakers, there were speech recognition programs that dictated text and phone-based interfaces that listened to basic commands and numbers.
The general process for speech recognition begins with an audio sample. The computer records some speech, encoded as raw sound waves, just like that recorded in a voice memo application. Before speech recognition algorithms even try to recognize speech, they apply several techniques to “clean” the recording, trying to distinguish between background and foreground sound, removing background sound. They segment sound into utterances separated by silence, which can be more easily classified. They rely on large databases of phonetic patterns, which define the kinds of sounds that are used in a particular natural language. They use machine learning techniques to try to classify these phonetic utterances. And then finally, once these phonemes are classified, they try to recognize sequences of phonetic utterances as particular words. More advanced techniques used in modern speech recognition may also analyze an entire sequence of phonetic utterances to try to infer the most likely possible nouns, noun phrases, and other parts of sentences, based on the rest of the content of the sentence.
Whether speech recognition works well depends on what data is used to train the recognizers. For example, if the data only includes people with particular accents, recognition will work well for those accents and not others. The same applies to pronunciation variations: for example, for a speech recognizer to handle both the American and British spellings pronunciations of “aluminum” and “aluminium”, both pronunciations need to be in the sample data. This lack of diversity in training data is a significant source of recognition failure. It is also a source of gulf of executions, is it is not always clear when speaking to a recognition engine what it is trained on and therefore what enunciation might be necessary to get it to properly recognize a word or phrase.
Of course, even accounting for diversity, technologies for recognizing speech are insufficient on their own to be useful. Early HCI research explored how to translate graphical user interfaces into speech-based interfaces to make GUIs accessible to people who were blind or low vision 16 16 Elizabeth D. Mynatt and W. Keith Edwards (1992). Mapping GUIs to auditory interfaces. ACM Symposium on User Interface Software and Technology (UIST).
Voice can be useful far beyond giving simple commands. Researchers have explored its use in multi-modal contexts in which a user both points and gives speech commands to control complex interfaces 2,23 2 Richard A. Bolt (1980). "Put-that-there": Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGGRAPH '80).
Vidya Setlur, Sarah E. Battersby, Melanie Tory, Rich Gossweiler, Angel X. Chang (2016). Eviza: A Natural Language Interface for Visual Analysis. ACM Symposium on User Interface Software and Technology (UIST).
Tong Gao, Mira Dontcheva, Eytan Adar, Zhicheng Liu, Karrie G. Karahalios (2015). DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST '15).
Susumu Harada, Jacob O. Wobbrock, James A. Landay (2007). Voicedraw: a hands-free voice-driven drawing application for people with motor impairments. ACM SIGACCESS Conference on Computers and Accessibility.
Emmi Parviainen (2020). Experiential Qualities of Whispering with Voice Assistants. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
Of course, in all of these voice interactions, there are fundamental gulfs of execution and evaluation. Early work on speech interfaces 25 25 Nicole Yankelovich, Gina-Anne Levow, Matt Marx (1995). Designing SpeechActs: issues in speech user interfaces. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
Gaze
Another useful muscle in our bodies are those in our eyes that control our gaze. Gaze is inherently social information that refers to inferences about where someone is looking based on the position of their pupils relative to targets in a space. Gaze tracking technology usually involves machine vision techniques, using infrared illumination of pupils, then time series analysis and individual calibration to track pupil movements over time. These can have many useful applications in things like virtual reality . For example, in a game, gaze detection might allow non-player characters to notice when you are looking at them, and respond to you. In the same way, gaze information might be propagated to a player’s avatar, moving it’s eyes in the player’s eyes move, helping other human players in social and collaborative games know when a teammate is looking your way.
Gaze recognition techniques are similar in concept to speech recognition in that they rely on machine learning and large data sets. However, rather than training on speech utterances, these techniques use data sets of eyes to detect and track the movement of pupils over time. The quality of this tracking depends heavily on the quality of cameras, as pupils are small and their movements are even smaller. Pupil movement is also quite fast, and so cameras need to record at a high frame rate to monitor movement. Eyes also have very particular movements, such as saccades , which are ballistic motions that abruptly shift from one point of fixation to another. Most techniques overcome the challenges imposed by these dynamic properties of eye movement by aggregating and averaging movement over time and using that as input.
Researchers have exploited gaze detection as a form of hands-free interaction, usually discriminating between looking and acting with dwell-times : if someone looks at something long enough, it is interpreted as an interaction. For example, some work has used dwell times and the top and bottom of a window to support gaze-controlled scrolling through content 11 11 Manu Kumar and Terry Winograd (2007). Gaze-enhanced scrolling techniques. ACM Symposium on User Interface Software and Technology (UIST).
Christof Lutteroth, Moiz Penkar, Gerald Weber (2015). Gaze vs. mouse: A fast and accurate gaze-only click alternative. ACM Symposium on User Interface Software and Technology (UIST).
Christian Lander, Sven Gehring, Antonio Krüger, Sebastian Boring, Andreas Bulling (2015). GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays. ACM Symposium on User Interface Software and Technology (UIST).
Ken Pfeuffer and Hans Gellersen (2016). Gaze and Touch Interaction on Tablets. ACM Symposium on User Interface Software and Technology (UIST).
Unfortunately, designing for gaze can require sophisticated knowledge of constraints on human perception. For example, one effort to design a hands-free, gaze-based drawing tool for people with motor disabilities ended up requiring the careful design of commands to avoid frustrating error handling and disambiguation controls 9 9 Anthony J. Hornof and Anna Cavender (2005). EyeDraw: enabling children with severe motor impairments to draw with their eyes. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
Josh Andres, m.c. Schraefel, Nathan Semertzidis, Brahmi Dwivedi, Yutika C Kulwe, Juerg von Kaenel, Florian 'Floyd' Mueller (2020). Introducing Peripheral Awareness as a Neurological State for Human-Computer Integration. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
Limbs
Whereas our hands, voices, and eyes are all very specialized for social interaction, the other muscles in our limbs are more for movement. That said, they also offer rich opportunities for interaction. Because muscles are activated using electrical signals, techniques like electromyography (EMG) can be used to sense muscle activity through electrodes placed on the skin. For example, this technique was used to sense motor actions in the forearm, discriminating between different grips to support applications like games 22 22 T. Scott Saponas, Desney S. Tan, Dan Morris, Ravin Balakrishnan, Jim Turner, and James A. Landay (2009). Enabling always-available input with muscle-computer interfaces. ACM Symposium on User Interface Software and Technology (UIST).
Other ideas focused on the movement of the tongue muscles using optical sensors and a mouth retainer 21 21 T. Scott Saponas, Daniel Kelly, Babak A. Parviz, and Desney S. Tan (2009). Optically sensing tongue gestures for computer input. ACM Symposium on User Interface Software and Technology (UIST).
Ronit Slyper, Jill Lehman, Jodi Forlizzi, Jessica Hodgins (2011). A tongue input device for creating conversations. ACM Symposium on User Interface Software and Technology (UIST).
Shwetak N. Patel and Gregory D. Abowd (2007). Blui: low-cost localized blowable user interfaces. ACM Symposium on User Interface Software and Technology (UIST).
Mayank Goel, Chen Zhao, Ruth Vinisha, Shwetak N. Patel (2015). Tongue-in-Cheek: Using Wireless Signals to Enable Non-Intrusive and Flexible Facial Gestures Detection. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15).
While these explorations of limb-based interactions have all been demonstrated to be feasible to agree, as with many new interactions, they impose many gulfs of execution and evaluation that have not yet been explored.
- How can users learn what a system is trying to recognize about limb movement?
- How much will users be willing to train the machine learned classifiers used to recognize movement?
- When errors inevitably occur, how can we support people in recovering from error, but also improving classification in the future?
Body
Other applications have moved beyond specific muscles to the entire body and its skeletal structure. The most widely known example of this is the Microsoft Kinect , shown in the image at the beginning of this chapter. The Kinect used a range of cameras and infrafred projectors to create a depth map of a room, including the structure and posture of the people in this room. Using this depth map, it build basic skeletal models of players, including the precise position and orientation of arms, legs, and heads. This information was then available in real-time for games to use as sources of input (e.g., mapping the skeletal model on to an avatar, using skeletal gestures to invoke commands).
But Kinect was just one technology that emerged from a whole range of explorations of body-based sensing. For example, researchers have explored whole-body gestures, allowing for hands-free interactions. This technique uses the human body as an antenna for sensing, requiring no instrumentation to the environment, and only a little instrumentation of the user 4 4 Gabe Cohn, Daniel Morris, Shwetak Patel, Desney Tan (2012). Humantenna: using the body as an antenna for real-time whole-body interaction. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
Similar techniques sense a person tapping patterns on their body 3 3 Xiang 'Anthony' Chen and Yang Li (2016). Bootstrapping user-defined body tapping recognition with offline-learned probabilistic representation. ACM Symposium on User Interface Software and Technology (UIST).
Masa Ogata, Yuta Sugiura, Yasutoshi Makino, Masahiko Inami, Michita Imai (2013). SenSkin: adapting skin as a soft interface. ACM Symposium on User Interface Software and Technology (UIST).
Mujibiya, A., & Rekimoto, J. (2013). Mirage: exploring interaction modalities using off-body static electric field sensing. ACM Symposium on User Interface Software and Technology (UIST).
Konstantin Klamka, Andreas Siegel, Stefan Vogt, Fabian Göbel, Sophie Stellmach, Raimund Dachselt (2015). Look & Pedal: Hands-free Navigation in Zoomable Information Spaces through Gaze-supported Foot Input. ACM on International Conference on Multimodal Interaction (ICMI).
Sean Gustafson, Daniel Bierwirth, Patrick Baudisch (2010). Imaginary interfaces: spatial interaction with empty hands and without visual feedback. ACM Symposium on User Interface Software and Technology (UIST).
Whole body interactions pose some unique gulfs of execution and evaluation. For example, a false positive for a hand or limb-based gesture might mean unintentionally invoking a command. We might not expect these to be too common, as we often use our limbs only when we want to communicate. But we use our bodies all the time, moving, walking, and changing our posture. If recognition algorithms aren’t tuned to filter out the broad range of things we do with our body in every day life, the contexts in which body-based input might be used might be severely limited.
This survey of body-based input techniques show a wide range of possible ways of providing input. However, like any new interface ideas, there are many unresolved questions about how to train people to use them and how to support error recovery. There are also many questions about the contexts in which such human-computer interaction might be socially acceptable. For example, how close do we want to be with computers. New questions about human-computer integration 14 14 Florian Floyd Mueller, Pedro Lopes, Paul Strohmeier, Wendy Ju, Caitlyn Seim, Martin Weigel, Suranga Nanayakkara, Marianna Obrist, Zhuying Li, Joseph Delfa, Jun Nishida, Elizabeth M. Gerber, Dag Svanaes, Jonathan Grudin, Stefan Greuter, Kai Kunze, Thomas Erickson, Steven Greenspan, Masahiko Inami, Joe Marshall, Harald Reiterer, Katrin Wolf, Jochen Meyer, Thecla Schiphorst, Dakuo Wang, and Pattie Maes (2020). Next Steps for Human-Computer Integration. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
- Symbiosis might entail humans and digital technology working together, in which software works on our behalf, and in return, we maintain and improve it. For example, examples above where computers exhibit agency of their own (e.g., noticing things that we miss, and telling us) are a kind of symbiosis.
- Fusion might entail using computers to extend our bodies and bodily experiences. Many of the techniques described above might be described as fusion, where they expand our abilities.
Do you want this future? If so, the maturity of these ideas are not quite sufficient to bring them to market. If not, what kind of alternative visions might you imagine to prevent these futures?
References
-
Josh Andres, m.c. Schraefel, Nathan Semertzidis, Brahmi Dwivedi, Yutika C Kulwe, Juerg von Kaenel, Florian 'Floyd' Mueller (2020). Introducing Peripheral Awareness as a Neurological State for Human-Computer Integration. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
-
Richard A. Bolt (1980). "Put-that-there": Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGGRAPH '80).
-
Xiang 'Anthony' Chen and Yang Li (2016). Bootstrapping user-defined body tapping recognition with offline-learned probabilistic representation. ACM Symposium on User Interface Software and Technology (UIST).
-
Gabe Cohn, Daniel Morris, Shwetak Patel, Desney Tan (2012). Humantenna: using the body as an antenna for real-time whole-body interaction. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
-
Tong Gao, Mira Dontcheva, Eytan Adar, Zhicheng Liu, Karrie G. Karahalios (2015). DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST '15).
-
Mayank Goel, Chen Zhao, Ruth Vinisha, Shwetak N. Patel (2015). Tongue-in-Cheek: Using Wireless Signals to Enable Non-Intrusive and Flexible Facial Gestures Detection. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15).
-
Sean Gustafson, Daniel Bierwirth, Patrick Baudisch (2010). Imaginary interfaces: spatial interaction with empty hands and without visual feedback. ACM Symposium on User Interface Software and Technology (UIST).
-
Susumu Harada, Jacob O. Wobbrock, James A. Landay (2007). Voicedraw: a hands-free voice-driven drawing application for people with motor impairments. ACM SIGACCESS Conference on Computers and Accessibility.
-
Anthony J. Hornof and Anna Cavender (2005). EyeDraw: enabling children with severe motor impairments to draw with their eyes. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
-
Konstantin Klamka, Andreas Siegel, Stefan Vogt, Fabian Göbel, Sophie Stellmach, Raimund Dachselt (2015). Look & Pedal: Hands-free Navigation in Zoomable Information Spaces through Gaze-supported Foot Input. ACM on International Conference on Multimodal Interaction (ICMI).
-
Manu Kumar and Terry Winograd (2007). Gaze-enhanced scrolling techniques. ACM Symposium on User Interface Software and Technology (UIST).
-
Christian Lander, Sven Gehring, Antonio Krüger, Sebastian Boring, Andreas Bulling (2015). GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays. ACM Symposium on User Interface Software and Technology (UIST).
-
Christof Lutteroth, Moiz Penkar, Gerald Weber (2015). Gaze vs. mouse: A fast and accurate gaze-only click alternative. ACM Symposium on User Interface Software and Technology (UIST).
-
Florian Floyd Mueller, Pedro Lopes, Paul Strohmeier, Wendy Ju, Caitlyn Seim, Martin Weigel, Suranga Nanayakkara, Marianna Obrist, Zhuying Li, Joseph Delfa, Jun Nishida, Elizabeth M. Gerber, Dag Svanaes, Jonathan Grudin, Stefan Greuter, Kai Kunze, Thomas Erickson, Steven Greenspan, Masahiko Inami, Joe Marshall, Harald Reiterer, Katrin Wolf, Jochen Meyer, Thecla Schiphorst, Dakuo Wang, and Pattie Maes (2020). Next Steps for Human-Computer Integration. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
-
Mujibiya, A., & Rekimoto, J. (2013). Mirage: exploring interaction modalities using off-body static electric field sensing. ACM Symposium on User Interface Software and Technology (UIST).
-
Elizabeth D. Mynatt and W. Keith Edwards (1992). Mapping GUIs to auditory interfaces. ACM Symposium on User Interface Software and Technology (UIST).
-
Masa Ogata, Yuta Sugiura, Yasutoshi Makino, Masahiko Inami, Michita Imai (2013). SenSkin: adapting skin as a soft interface. ACM Symposium on User Interface Software and Technology (UIST).
-
Emmi Parviainen (2020). Experiential Qualities of Whispering with Voice Assistants. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).
-
Shwetak N. Patel and Gregory D. Abowd (2007). Blui: low-cost localized blowable user interfaces. ACM Symposium on User Interface Software and Technology (UIST).
-
Ken Pfeuffer and Hans Gellersen (2016). Gaze and Touch Interaction on Tablets. ACM Symposium on User Interface Software and Technology (UIST).
-
T. Scott Saponas, Daniel Kelly, Babak A. Parviz, and Desney S. Tan (2009). Optically sensing tongue gestures for computer input. ACM Symposium on User Interface Software and Technology (UIST).
-
T. Scott Saponas, Desney S. Tan, Dan Morris, Ravin Balakrishnan, Jim Turner, and James A. Landay (2009). Enabling always-available input with muscle-computer interfaces. ACM Symposium on User Interface Software and Technology (UIST).
-
Vidya Setlur, Sarah E. Battersby, Melanie Tory, Rich Gossweiler, Angel X. Chang (2016). Eviza: A Natural Language Interface for Visual Analysis. ACM Symposium on User Interface Software and Technology (UIST).
-
Ronit Slyper, Jill Lehman, Jodi Forlizzi, Jessica Hodgins (2011). A tongue input device for creating conversations. ACM Symposium on User Interface Software and Technology (UIST).
-
Nicole Yankelovich, Gina-Anne Levow, Matt Marx (1995). Designing SpeechActs: issues in speech user interfaces. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI).