Chapter 9 Answering questions: how to write reports

9.1 Answering the questions: writing

After you have found answers to your question, you want in most cases to make the answers available to others. There are various different media you may be able to choose, but for many applications and tasks, the good old-fashioned text is the preferred one. Text has two major advantages over most of the other media–it is easy to produce, and it allows the reader to go over it in their own pace, and move back and forth as they need. Just think how easy it is to correct a sentence you wrote in a text, versus a sentence you said in a video. So text is a major product of data science work.

Obviously, there are many other forms of outcomes, e.g. dashboards, graphs and tables, videos, or products, and text does not support well many interactive tools. But in this section we focus on text, in particular to text that gives the reader sufficient information to use your results for decision-making. Many of the considerations here also hold for other media.

Here we discuss the main parts of your report, below we give a few examples.

In order to make your contribution valuable, you need to give all the information that is needed to evaluate the applicability (external validity) of your results. A major part of it is to explain the dataset you are using from data integrity perspective.

TBD: introduction

First we have to understand the origin of these data. Who collected the data? How was the dataset collected? For what purpose? Is the sampling scheme known? Answering these questions is necessary to evaluate how good is the information there, and whether the results can be generalized outside of the dataset (external validity). Ideally, we would like to use data that is collected to answer exactly the question we are asking by a reliable and neutral institution through uniform random sampling. This happens, but unfortunately not that often. But there are examples, for instance GAIA survey of billion stars by ESA, or labor force surveys conducted by statistical offices. Uniform random sampling is hard to achieve, but at least these institutions do the best to come as close as they can, and document the known deviations and problems.

Unfortunately, often we have to work with data that are collected for very different purpose, and where the sampling scheme is unknown. For instance, cellphone operators regularly log all the calls and texts in their networks, but how each particular operator relates to the total cellphone communication is unknown. Even more, cellphones are just one aspect of communication and we do not know how does your contact network in phone communication relate to your overall social networks. But even if you do not know the answers, you should at least tell your readers what do you know.

Next, you should provide some kind of description of the dataset, focusing on the variables you are using below. The most important parameters are number of observations and the variables themselves. How many missing values are there? A large percentage of missings suggests the data is of low quality and hence your results may be unreliable. What kind of information do the variables contain? For instance, if we are analyzing the relationship between different behavior and health, then we want to be sure that we actually see different behavior, and different health status in data. If everyone in the dataset has healthy lifestyle and is never sick, then we probably cannot tell much about lifestyle and health.

We should also perform some consistency checks–does the data we analyze reflect broadly the population under study? For instance, does the dataset show a broadly similar age distribution as is known to be in the population of interest? This is very important if the actual sampling is unknown.

Your report should tell what is the analysis focused on, what are the questions you are trying to answer. There is typically too many question to be answered and each one will require somewhat different analysis. You have to draw the line somewhere and do well the most important ones. Sometimes it is obvious why did you pick these questions (e.g. your boss told you so), but other times you need to explain to your readers why is it even interesting what you do and why should someone continue reading it.

Next, and quite important section is methodology. How exactly are you answering the questions based on these data. Typically, this requires some sort of statistical methodology, and in a more complex analysis it may require a number of statistical tools, but also a number of computational tools. For instance, how did you define outliers and whether you removed those? How exactly did you define workday? 8am–5pm? 9am–5pm? Did you define it in a similar fashion across time zones? There is a plethora of such decisions to be made, and all those should be documented, not just for the reader but also for yourself!

Now finally we get to perhaps the most important part–results. What did you find? Were you able to answer the questions you asked? The results are sometimes combined with discussion, but discussion may also be a separate section. We are not just interested to know what was the answer, but also what does the answer tell us. Did you find anything interesting, new, unexpected? Do the answers allow us to do something better? This section is also the place to talk about the limitations of your analysis. How may the problematic data affect your results? Can you generalize from your results to similar future cases? To similar cases elsewhere?

9.2 Analysis report: Titanic example

Let us write a simple and short analysis regarding the survival on Titanic. Imagine, you are working for a maritime safety board, and they ask you to analyze “who survived the Titanic disaster”.

As the first task, you should understand what does the board want. It is probably not a list of names of survivors–these names are easily available, and not particularly interesting for the board. The board is rather interested in understanding what kind of people were more likely to survive. This might give ideas about how to promote maritime safety in the future. So a more relevant question may be

What type of people, in particular sex, age and passenger class groups were more likely to survive?

To begin with, we leave the abstract (a very brief summary) to the last thing we do. It is usually way easier to summarize something you have already done!

Now we should write some introductory remarks. For the safety board, probably not much introduction is needed, but if your report is also read by a general audience, then a long introduction may be in order. But here we stay brief:

Titanic was a luxurious ocean liner that sank in its mayden voyage in 1912. It had approximately 1300 passengers and 870 crew on board, out of those, 706 people survived and approximately 1500 died. This analysis looks at which passenger groups were more likely to survive. We will analyze sex, age and passenger class.