Monitoring

The first application I ever wrote was a complete and utter failure.

I was an eager eighth grader, full of wonder and excitement about the infinite possibilities in code, with an insatiable desire to build, build, build. I’d made plenty of little games and widgets for myself, but now was my chance to create something for someone else: my friend and I were making a game and he needed a tool to create pixel art for it. We had no money for fancy Adobe licenses, and so I decided to make a tool.

In designing the app, I made every imaginable software engineering mistake. I didn’t talk to him about requirements. I didn’t test on his computer before sending the finished app. I certainly didn’t conduct any usability tests, performance tests, or acceptance tests. The app I ended up shipping was a pure expression of what I wanted to build, not what he needed to be creative or productive. As a result, it was buggy, slow, confusing, and useless, and blinded by my joy of coding, I had no clue.

Now, ideally my “customer” would have reported any of these problems to me right away, and I would have learned some tough lessons about software engineering. But this customer was my best friend, and also a very nice guy. He wasn’t about to trash all of my hard work. Instead, he suffered in silence. He struggled to install, struggled to use, and worst of all struggled to create. He produced some amazing art a few weeks after I gave him the app, but it was only after a few months of progress on our game that I learned he hadn’t used my app for a single asset, preferring instead to suffer through Microsoft Paint. My app was too buggy, too slow, and too confusing to be useful. I was devastated.

Why didn’t I know it was such a complete failure? Because I wasn’t looking . I’d ignored the ultimate test suite: my customer . I’d learned that the only way to really know whether software requirements are right is by watching how it executes in the world through monitoring ¹³

James Turnbull (2016). The art of monitoring with James Turnbull. Software Engineering Daily Podcast.

Of course, this is easier said than done. That’s because the (ideally) massive numbers of people executing your software is not easily observable ¹¹

Tim Menzies, Tom Zimmermann (2013). Software analytics: so what?. IEEE Software.

. Moreover, each software quality you might want to monitor (performance, functional correctness, usability) requires entirely different methods of observation and analysis. Let’s talk about some of the most important qualities to monitor and how to monitor them.

These are some of the easiest failures to detect because they are overt and unambiguous. Microsoft was one of the first organizations to do this comprehensively, building what eventually became known as Windows Error Reporting ⁷

Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt (2009). Debugging in the (very) large: ten years of implementation and experience. ACM SIGOPS Symposium on Operating Systems Principles (SOSP).

. It turns out that actually capturing these errors at scale and mining them for repeating, reproducible failures is quite complex, requiring classification, progressive data collection, and many statistical techniques to extract signal from noise. In fact, Microsoft has a dedicated team of data scientists and engineers whose sole job is to manage the error reporting infrastructure, monitor and triage incoming errors, and use trends in errors to make decisions about improvements to future releases and release processes. This is now standard practice in most companies and organizations, including other big software companies (Google, Apple, IBM, etc.), as well as open source projects (eg, Mozilla). In fact, many application development platforms now include this as a standard operating system feature.

Performance, like crashes, kernel panics, and hangs, is easily observable in software, but a bit trickier to characterize as good or bad. How slow is too slow? How bad is it if something is slow occasionally? You’ll have to define acceptable thresholds for different use cases to be able to identify problems automatically. Some experts in industry ⁸

Andi Grabner (2016). Performance monitoring with Andi Grabner. Software Engineering Daily Podcast.

still view this as an art.

It’s also hard to monitor performance without actually harming performance. Many tools and web services (e.g., New Relic ) are getting better at reducing this overhead and offering real time data about performance problems through sampling.

Monitoring for data breaches, identity theft, and other security and privacy concerns are incredibly important parts of running a service, but also very challenging. This is partly because the tools for doing this monitoring are not yet well integrated, requiring each team to develop its own practices and monitoring infrastructure. But it’s also because protecting data and identity is more than just detecting and blocking malicious payloads. It’s also about recovering from ones that get through, developing reliable data streams about application network activity, monitoring for anomalies and trends in those streams, and developing practices for tracking and responding to warnings that your monitoring system might generate. Researchers are still actively inventing more scalable, usable, and deployable techniques for all of these activities.

The biggest limitation of the monitoring above is that it only reveals what people are doing with your software, not why they are doing it, or why it has failed. Monitoring can help you know that a problem exists, but it can’t tell you why a program failed or why a person failed to use your software successfully.

Usability problems and missing features, unlike some of the preceding problems, are even harder to detect or observe, because the only true indicator that something is hard to use is in a user’s mind. That said, there are a couple of approaches to detecting the possibility of usability problems.

One is by monitoring application usage. Assuming your users will tolerate being watched, there are many techniques: 1) automatically instrumenting applications for user interaction events, 2) mining events for problematic patterns, and 3) browsing and analyzing patterns for more subjective issues ⁹

Melody Y. Ivory, Marti A. Hearst (2001). The state of the art in automating usability evaluation of user interfaces. ACM Computing Surveys.

. Modern tools and services like make it easier to capture, store, and analyze this usage data, although they still require you to have some upfront intuition about what to monitor. More advanced, experimental techniques in research automatically analyze undo events as indicators of usability problems ¹ . This work observes that undo is often an indicator of a mistake in creative software, and mistakes are often indicators of usability problems.

All of the usage data above can tell you what your users are doing, but not why . For this, you’ll need to get explicit feedback from support tickets, support forums, product reviews, and other critiques of user experience. Some of these types of reports go directly to engineering teams, becoming part of bug reporting systems, while others end up in customer service or marketing departments. While all of this data is valuable for monitoring user experience, most companies still do a bad job of using anything but bug reports to improve user experience, overlooking the rich insights in customer service interactions ⁵

Parmit K. Chilana, Amy J. Ko, Jacob O. Wobbrock, Tovi Grossman, and George Fitzmaurice (2011). Post-deployment usability: a survey of current practices. ACM SIGCHI Conference on Human Factors in Computing (CHI).

Although bug reports are widely used, they have significant problems as a way to monitor: for developers to fix a problem, they need detailed steps to reproduce the problem, or stack traces or other state to help them track down the cause of a problem ⁴

Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann (2008). What makes a good bug report?. ACM SIGSOFT Foundations of Software Engineering (FSE).

; these are precisely the kinds of information that are hard for users to find and submit, given that most people aren’t trained to produce reliable, precise information for failure reproduction. Additionally, once the information is recorded in a bug report, even interpreting the information requires social, organizational, and technical knowledge, meaning that if a problem is not addressed soon, an organization’s ability to even interpret what the failure was and what caused it can decay over time ² . All of these issues can lead to intractable debugging challenges ¹² .

Larger software organizations now employ data scientists to help mitigate these challenges of analyzing and maintaining monitoring data and bug reports. Most of them try to answer questions such as ³

Andy Begel, Thomas Zimmermann (2014). Analyze this! 145 questions for data scientists in software engineering. ACM/IEEE International Conference on Software Engineering.

“How do users typically use my application?”
“What parts of a software product are most used and/or loved by customers?”
“What are best key performance indicators (KPIs) for monitoring services?”
“What are the common patterns of execution in my application?”
“How well does test coverage correspond to actual code usage by our customers?”

The most mature data science roles in software engineering teams even have multiple distinct roles, including Insight Providers , who gather and analyze data to inform decisions, Modeling Specialists , who use their machine learning expertise to build predictive models, Platform Builders , who create the infrastructure necessary for gathering data ¹⁰

Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel (2016). The emerging role of data scientists on software development teams. ACM/IEEE International Conference on Software Engineering.

. Of course, smaller organizations may have individuals who take on all of these roles. Moreover, not all ways of discovering missing requirements are data science roles. Many companies, for example, have customer experience specialists and community managers, who are less interested in data about experiences and more interested in directly communicating with customers about their experiences. These relational forms of monitoring can be much more effective at revealing software quality issues that aren’t as easily observed, such as issues of racial or sexual bias in software or other forms of structural injustices built into the architecture of software.

All of this effort to capture and maintain user feedback can be messy to analyze because it usually comes in the form of natural language text. Services like AnswerDash (a company I co-founded) structure this data by organizing requests around frequently asked questions. AnswerDash imposes a little widget on every page in a web application, making it easy for users to submit questions and find answers to previously asked questions. This generates data about the features and use cases that are leading to the most confusion, which types of users are having this confusion, and where in an application the confusion is happening most frequently. This product was based on several years of research in my lab ⁶

Parmit K. Chilana, Amy J. Ko, Jacob O. Wobbrock, Tovi Grossman (2013). A multi-site field study of crowdsourced contextual help: usage and perspectives of end users and software teams. ACM SIGCHI Conference on Human Factors in Computing (CHI).

References

David Akers, Matthew Simpson, Robin Jeffries, and Terry Winograd (2009). Undo and erase events as indicators of usability problems. ACM SIGCHI Conference on Human Factors in Computing (CHI).
Jorge Aranda and Gina Venolia (2009). The secret life of bugs: Going past the errors and omissions in software repositories. ACM/IEEE International Conference on Software Engineering.
Andy Begel, Thomas Zimmermann (2014). Analyze this! 145 questions for data scientists in software engineering. ACM/IEEE International Conference on Software Engineering.
Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann (2008). What makes a good bug report?. ACM SIGSOFT Foundations of Software Engineering (FSE).
Parmit K. Chilana, Amy J. Ko, Jacob O. Wobbrock, Tovi Grossman, and George Fitzmaurice (2011). Post-deployment usability: a survey of current practices. ACM SIGCHI Conference on Human Factors in Computing (CHI).
Parmit K. Chilana, Amy J. Ko, Jacob O. Wobbrock, Tovi Grossman (2013). A multi-site field study of crowdsourced contextual help: usage and perspectives of end users and software teams. ACM SIGCHI Conference on Human Factors in Computing (CHI).
Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt (2009). Debugging in the (very) large: ten years of implementation and experience. ACM SIGOPS Symposium on Operating Systems Principles (SOSP).
Andi Grabner (2016). Performance monitoring with Andi Grabner. Software Engineering Daily Podcast.
Melody Y. Ivory, Marti A. Hearst (2001). The state of the art in automating usability evaluation of user interfaces. ACM Computing Surveys.
Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel (2016). The emerging role of data scientists on software development teams. ACM/IEEE International Conference on Software Engineering.
Tim Menzies, Tom Zimmermann (2013). Software analytics: so what?. IEEE Software.
Haseeb Qureshi (2016). Debugging stories with Haseeb Qureshi. Software Engineering Daily Podcast.
James Turnbull (2016). The art of monitoring with James Turnbull. Software Engineering Daily Podcast.

Monitoring

Discovering Failures

Discovering Missing Requirements

References