Back to table of contents

A man and a woman playing XBox Kinect game. A man and a woman playing XBox Kinect game with their bodies. Credit: Sergey Galyonkin, Flickr

Bodies

Andrew J. Ko

In the last chapter, I introduced Weiser's vision of ubiquitous computing, and argued that part of fulfilling it was fully exploiting the versatility of our hands. In this chapter, we explore all of the sources of human action other than hands and the range of ways this action has been channeled as input to computers. As we shall see, many of the same gulfs of execution and evaluation arise with body-based input as does with hands.

Part of this is that all body-based input is based in probabilistically recognizing the actions of muscles. When we speak or make other sounds, we use the muscles in our throat, mouth, and face. When we look with our eyes, we use muscles to shift our gaze, to blink, to squint, and keep our eyes closed. We use our muscles to move and arrange our limbs. Perhaps the only action we perform that isn't driven by muscles is thought. Our brain drives our muscles through electrical signals, which are detectable through our skin through techniques such as EMG and EEG. All of these forms of action are central to our existence in a physical world. Researchers have explored how to leverage these muscle-based signals to act in virtual worlds.

Speech

Aside from our hands, one of the most versatile ways we use our muscles to engage with our environment is to use our voice. We use it to speak, to hum, to sing, and to make other non-verbal auditory sounds to communicate with humans and other living things. Why not computers? Voice-based interactions have been a dream for decades, and long-imagined in science fiction. Only after decades of progress on speech recognition, spanning the early 1950's at Bell Labs, to continued research today in academia, did voice interfaces become reliable enough for interaction. Before the ubiquitous digital voice assistants on smartphones and smartspeakers, there were speech recognition programs that dictated text and phone-based interfaces that listened to basic commands and numbers.

Early HCI research explored how to translate graphical user interfaces into speech-based interfaces to make GUIs accessible to people who were blind or low vision (Mynatt and Edwards 1992). The project that finally pushed voice interactions into the mainstream was the DARPA-funded CALO research project, which, as described by it's acronym, "Cognitive Assistant that Learns and Organizes," sought to create intelligent digital assistants using the latest advances in artificial intelligence. One branch of this project at SRI International spun out some of the speech recognition technology, building upon decades of advances to support real-time speech command recognition by applying advances in deep learning from the late 1990's. The eventual result was assistants like Siri, Alexa, and OK Google, which offer simple command libraries, much like a menu in a graphical user interface, but accessed with voice instead of a pointing device.

Voice is useful far beyond giving simple commands. Researchers have explored it's use in multi-modal contexts in which a user both points and gives speech commands to control complex interfaces (Bolt 1980, Setler et al. 2016). Some have applied voice interactions to support more rapid data exploration by using natural language queries and "ambiguity widgets" to show ranges of possible queries (Gao et al. 2015). Others have used non-verbal voice humming to offer hands free interaction with creative tools, like drawing applications (Harada et al. 2007).

Of course, in all of these voice interactions, there are fundamental gulfs of execution and evaluation. Early work on speech interfaces (Yankelovich et al. 1995) found that users expected conversational fluency in machines, that they struggled to overcome recognition errors, and that the design of commands for GUIs often did not translate well to speech interfaces. They also found that speech itself poses cognitive demands on working memory to form correct commands, retain the state of the machine, and anticipate the behavior of the conversational agent. These all further aggravated recognition errors. All of these gulfs persist in modern speech interfaces, though to a lesser extent as speech recognition has improved.

Gaze

Another useful muscle in our bodies are those that control our gaze. Gaze is inherently social information that refers to inferences about where someone is looking based on the position of their pupils relative to targets in a space. Gaze tracking technology usually involves machine vision techniques, using infrared illumination of pupils, then time series analysis and individual calibration to track pupil movements over time. These can have useful applications in things like virtual reality.

Researchers have exploited gaze detection as a form of hands-free interaction, usually discriminating between looking and acting with dwell-times: if someone looks at something long enough, it is interpreted as an interaction. For example, some work has used dwell times and the top and bottom of a window to support gaze-controlled scrolling through content (Kumar and Winograd 2007). Other techniques analyze the targets in an interface (e.g., links in a web browser) and make them controllable through gaze and dwell (Lutteroth et al. 2015). Some researchers have even extended this to multiple displays, making it easier to quickly interact with different devices at a distance (Lander et al. 2015). One way to avoid waiting for dwell times is to combine gaze with other interaction. For example, some systems have combined gaze and touch, as in the video below, to use gaze for pointing and touch for action (Pfeuffer and Gellersen 2016):

Combining gaze and touch for multimodal interactions.

Unfortunately, designing for gaze can require sophisticated knowledge of constraints on human perception. For example, one effort to design a hands-free, gaze-based drawing tool for people with motor disabilities ended up requiring the careful of commands to avoid frustrating error handling and disambiguation controls (Hornof and Cavender 2005). Like other forms of recognition-based input, without significant attention to supporting recovery from recognition error, people will face challenges trying to perform the action they want efficiently and correctly.

Limbs

Whereas our hands, voices, and eyes are all very specialized for social interaction, the other muscles in our limbs are more for movement. That said, they also offer rich opportunities for interaction. Because muscles are activated using electrical signals, techniques like electromyography (EMG) can be used to sense muscle activity through electrodes placed on the skin. For example, this technique was used to sense motor actions in the forearm, discriminating between different grips to support applications like "air guitar hero" (Saponas et al. 2009):

Sensing forearm grip.

Other ideas focused on the movement of the tongue muscles using optical sensors and a mouth retainer (Saponas et al. 2009) and tongue joysticks for menu navigation (Slyper et al. 2011). Some focused on the breath, localizing where a person was blowing at a microphone installed on a laptop or computer screen (Patel and Abowd 2007). Others have defined entire gesture sets for tongue movement that can be detected non-invasively through wireless X-band Doppler movement detection (Goel et al. 2015).

While these explorations of limb-based interactions have all been demonstrated to be feasible to agree, as with many new interactions, they impose many gulfs of execution and evaluation that have not yet been explored. How can users learn what a system is trying to recognize about limb movement? How much will users be willing to train the machine learned classifiers used to recognize movement? And when errors inevitably occur, how can we support people in recovering from error, but also improving classification in the future?

Body

Other applications have moved beyond specific muscles to the entire body and its skeletal structure. For example, researchers have explored whole body gestures, allowing for hands-free interactions. This technique uses the human body as an antenna for sensing, requiring no instrumentation to the environment, and only a little instrumentation of the user (Cohn et al. 2012):

Humantenna's whole body gestures.

Similar techniques sense a person tapping patterns on their body (Chen and Li 2016), sensing skin deformation through a band that also provide tactile feedback (Ogata et al. 2013), and the detection of entire human bodies through electric field distortion requiring no instrumentation of the user (Mujibiya and Rekimoto 2013). Other techniques coordinate multiple parts of the body, such as gaze detection and foot movement (Klamka et al. 2015). Even more extreme ideas include entirely imaginary interfaces, in which users perform spatial gestures with no device and no visual feedback, relying on machine vision techniques to map movement to action (Gustafson et al. 2010).

Whole body interactions pose some unique gulfs of execution and evaluation. For example, a false positive for a hand or limb-based gesture might mean unintentionally invoking a command. We might not expect these to be too common, as we often use our limbs only when we want to communicate. But we use our bodies all the time, moving, walking, and changing our posture. If recognition algorithms aren't tuned to filter out the broad range of things we do with our body in every day life, the contexts in which body-based input might be used might be severely limited.


This survey of body-based input techniques show a wide range of possible ways of providing input. However, like any new interface ideas, there are many unresolved questions about how to train people to use them and how to support error recovery. There are also many questions about the contexts in which such human-computer interaction might be socially acceptable. The maturity of these ideas are therefore very much at the level of feasibility demonstrations, not full visions of application design and operating system support.

Next chapter: 2D visual output

Further reading

Richard A. Bolt. 1980. "Put-that-there": Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGGRAPH '80). ACM, New York, NY, USA, 262-270.

Xiang 'Anthony' Chen and Yang Li. 2016. Bootstrapping User-Defined Body Tapping Recognition with Offline-Learned Probabilistic Representation. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST '16). ACM, New York, NY, USA, 359-364.

Gabe Cohn, Daniel Morris, Shwetak Patel, and Desney Tan. 2012. Humantenna: using the body as an antenna for real-time whole-body interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '12). ACM, New York, NY, USA, 1901-1910.

Tong Gao, Mira Dontcheva, Eytan Adar, Zhicheng Liu, and Karrie G. Karahalios. 2015. DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST '15). ACM, New York, NY, USA, 489-500.

Mayank Goel, Chen Zhao, Ruth Vinisha, and Shwetak N. Patel. 2015. Tongue-in-Cheek: Using Wireless Signals to Enable Non-Intrusive and Flexible Facial Gestures Detection. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 255-258.

Sean Gustafson, Daniel Bierwirth, and Patrick Baudisch. 2010. Imaginary interfaces: spatial interaction with empty hands and without visual feedback. In Proceedings of the 23nd annual ACM symposium on User interface software and technology (UIST '10). ACM, New York, NY, USA, 3-12.

Susumu Harada, Jacob O. Wobbrock, and James A. Landay. 2007. Voicedraw: a hands-free voice-driven drawing application for people with motor impairments. In Proceedings of the 9th international ACM SIGACCESS conference on Computers and accessibility (Assets '07). ACM, New York, NY, USA, 27-34.

Anthony J. Hornof and Anna Cavender. 2005. EyeDraw: enabling children with severe motor impairments to draw with their eyes. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '05). ACM, New York, NY, USA, 161-170.

Konstantin Klamka, Andreas Siegel, Stefan Vogt, Fabian Göbel, Sophie Stellmach, and Raimund Dachselt. 2015. Look & Pedal: Hands-free Navigation in Zoomable Information Spaces through Gaze-supported Foot Input. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI '15). ACM, New York, NY, USA, 123-130.

Manu Kumar and Terry Winograd. 2007. Gaze-enhanced scrolling techniques. In Proceedings of the 20th annual ACM symposium on User interface software and technology (UIST '07). ACM, New York, NY, USA, 213-216.

Christian Lander, Sven Gehring, Antonio Krüger, Sebastian Boring, and Andreas Bulling. 2015. GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST '15). ACM, New York, NY, USA, 395-404.

Christof Lutteroth, Moiz Penkar, and Gerald Weber. 2015. Gaze vs. Mouse: A Fast and Accurate Gaze-Only Click Alternative. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST '15). ACM, New York, NY, USA, 385-394.

Elizabeth D. Mynatt and W. Keith Edwards. 1992. Mapping GUIs to auditory interfaces. In Proceedings of the 5th annual ACM symposium on User interface software and technology (UIST '92). ACM, New York, NY, USA, 61-70.

Masa Ogata, Yuta Sugiura, Yasutoshi Makino, Masahiko Inami, and Michita Imai. 2013. SenSkin: adapting skin as a soft interface. In Proceedings of the 26th annual ACM symposium on User interface software and technology (UIST '13). ACM, New York, NY, USA, 539-544.

Shwetak N. Patel and Gregory D. Abowd. 2007. Blui: low-cost localized blowable user interfaces. In Proceedings of the 20th annual ACM symposium on User interface software and technology (UIST '07). ACM, New York, NY, USA, 217-220.

Ken Pfeuffer and Hans Gellersen. 2016. Gaze and Touch Interaction on Tablets. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST '16). ACM, New York, NY, USA, 301-311.

T. Scott Saponas, Daniel Kelly, Babak A. Parviz, and Desney S. Tan. 2009. Optically sensing tongue gestures for computer input. In Proceedings of the 22nd annual ACM symposium on User interface software and technology (UIST '09). ACM, New York, NY, USA, 177-180.

T. Scott Saponas, Desney S. Tan, Dan Morris, Ravin Balakrishnan, Jim Turner, and James A. Landay. 2009. Enabling always-available input with muscle-computer interfaces. In Proceedings of the 22nd annual ACM symposium on User interface software and technology (UIST '09). ACM, New York, NY, USA, 167-176.

Vidya Setlur, Sarah E. Battersby, Melanie Tory, Rich Gossweiler, and Angel X. Chang. 2016. Eviza: A Natural Language Interface for Visual Analysis. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST '16). ACM, New York, NY, USA, 365-377.

Ronit Slyper, Jill Lehman, Jodi Forlizzi, and Jessica Hodgins. 2011. A tongue input device for creating conversations. In Proceedings of the 24th annual ACM symposium on User interface software and technology (UIST '11). ACM, New York, NY, USA, 117-126.

Nicole Yankelovich, Gina-Anne Levow, and Matt Marx. 1995. Designing SpeechActs: issues in speech user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '95), Irvin R. Katz, Robert Mack, Linn Marks, Mary Beth Rosson, and Jakob Nielsen (Eds.). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 369-376.