Chapter 8 Asking questions and answering with data

One of the main reasons we work with data is that we want data to answer important questions. This is a form of evidence-based policy-making, using real-world evidence to analyze which kind of decisions give the desired results. Most of the evidence we have is somehow related to data, although some of it may be rather different from what we normally see in datasets.

Below, we discuss some of the issues with both questions, answers, and data needed for answering those questions.

8.1 Interesting questions, important questions and answerable questions

Some questions are interesting. For instance, what is meaning of life is quite an interesting question, but unfortunately one cannot really answer it. It is even unclear what would constitute an answer, and whether it would be useful for making decisions in our daily lives. There are many other similar type of questions, many “what” and “why” questions fall into this category. For instance, why does an apple fall down if you drop it? is such a question. And maybe a bit surprisingly, we do not have an answer. One may mention gravity, but gravity is not really an answer to a why-question. Gravity is a set of observations and rules that tell that everything falls down if dropped, and allows us to compute where it falls, and how fast it falls. So it is more like how, not why.

Nowadays do not need data to compute where and how fast the apple will fall, formulas of high-school physics will do. But historically this was not so. The theory of gravity was based on experiments (dropping apples and cannon balls) and observations (of Moon and planets), and related data collection. Later, it allowed the scholars to systematize the evidence and show that data can be predicted using a small set of rather simple mathematical rules–the Newton’s theory of gravity.

But most of us are not working on new fundamental natural laws. We may rather try to answer questions related to school, work, weather, whether to buy one or another product, and other questions reflecting our everyday decisions. Some of these questions are quite important while others are much less so. And some are rather easy to answer while the others are not.

For instance, if you are concerned about your safety on a dark street, then what you want to know is whether you reach home safely if you walk that dark street tonight. But if you follow the campus crime reports, those answer the questions whether someone was attacked there yesterday. Sure, attack yesterday may tell you something about how likely it is that you will be attacked tonight, but it is not the important question in itself. It is just a way toward the answer.

This boils down to a fundamental problem regarding the questions and answers: all data, everything we know, is about the past. All our decisions that matter are about the future. We need to find a way how the past cant tell us something about the future. In data science the “way” is often called “model”, although it does not always have to be a formal mathematical model.

Unfortunately, this makes it hard to answer many questions that are important for decision making. In the safety example, police may have data about the past attacks, but no-one has data about the future ones. In order to predict future attacks we need not just data (because data is always about the past). We need some sort of plausible model that we can use for predictions, so we can use past data to tune the model, and then use the model for predictions about the future. But such models are hard to build in complex cases, and human behavior, including crime, is very-very complex. What does a single past attack tell us about the future ones? Will the attacker prefer to linger in the same place? Or exactly the opposite–move somewhere else? And how will the police, and other night-time walkers change their behavior?

Similar examples hold about many other questions. For instance, in case of weather, we have past weather information in data, but need a model to predict the weather for tomorrow. Or in case of job search–we can know what kind of jobs did others like you find, but you still do not know what will you find.

The questions that are easy to answer tend to be already answered, or sometimes they are not worth of answering. In realistic datasets, such questions are often related to data description, questions like how many, how large and when. It is easy to answer questions about data–the past, it is much harder to answer questions about the world–the future. The previous sections 6 and 7 were both devoted to such easy answers.

Example 8.1 We do an example here with Titanic data. First we ask the question “how was survival rate related to the passenger class”? This is an easy question to answer, and we can do it as this:

## # A tibble: 3 × 2
##   pclass survival
##    <dbl>    <dbl>
## 1      1    0.619
## 2      2    0.430
## 3      3    0.255

(See Section 5.6 for grouped operations, and 7.3.1 for how to compute rates.)

Cafe Verandah on Titanic

Upper class passengers had access to luxurious facilities. Cafe Verandah on Titanic (from Wikimedia Commons)

We see that upper classes had noticeably better survival rate–in the first class, it was 61% and in the 3rd class it was 26%. But is this an interesting and relevant question?

It is definitely interesting if you are excited about Titanic and the related topics. But what will this knowledge give us? Can it help us to unsure there will be more survivors on the next shipwreck? Does it mean that you should also prefer an upper class cabin on your next voyage over the Big Pond? (Or maybe, as these days we tend to fly instead of sail–should I buy the business class ticket for my next flight?)

Unfortunately, knowledge that first class passengers were more likely to survive, will help us little to answer these questions. The knowledge may help, but we need to know more. We want to understand why–what made first class passengers more likely to survive? In case of Titanic, it boils down to better access to the boat deck and information, and larger percentage of women in the upper classes (women were given priority in boarding the lifeboats). Some of it we can learn from these data (percentage of women), but for other crucial information (who were first to get to lifeboats) we need different data sources.

First after we have an answer to the why-question, can we discuss if any of it is still relevant today for your next trip.

8.2 General questions and answerable questions

Another reason why many important questions tend to be hard to answer is that they are too general and include concepts that are too unclear and hard to measure.

For instance: one may be tempted to ask “which vaccine, Moderna or Pfeizer, is better”? However, this exact question cannot be answered. Here the problem is that the word better (i.e. good) has too many meanings. But we can do our best, and answer a related question, but now we may get different results, depending on what is the related question we actually answer. For instance:

  • which vaccine protects people better against COVID-19 by making them less likely to catch the disease?
  • which vaccine protects people better by making lowering the transmission rate of the virus?
  • which vaccine will protect you (or any other particular person) better?
  • which vaccine makes you less ill if you still get the virus?
  • which vaccine is more effective against a particular COVID-19 strain?
  • which vaccine has less harmful side-effects?

One can come up with many more ways to answer the question of “better”.

Here the issue is that the question itself is too vague. “Better” has many meanings, and even if we can measure and collect data about most of these, it is not clear what was the original meaning of the question. People without specific education usually do not even think in a detailed enough way, so we may end up asking many questions that are inherently unanswerable. In data science applications, one may want to team up with both virologists (who have good idea about how vira and humans interact) and with someone who can understand what are people asking even if they do not possess professional sophisticated vocabulary. In terms of most of the human knowledge, we are all amateurs.

Another reason why we cannot answer some questions is that the relevant data is either impossible or very hard to collect. For instance, “how much do students socialize?” is answerable through surveys (although “socialize” is not a precise concept). But it may be much easier to get information about social media accounts, so what we may end up doing is “how many friends you have on facebook?”.

Exercise 8.1 Consider the following questions:

  • What is the best movie of all time?
  • How old is universe?
  • Is the world getting more dangerous?

Which ones you can easily answer with data that, realistically, can be collected? How might you change the other questions to make those answerable?

See an answer

8.3 Example: who survived Titanic wreck?

Now it is time for an example. We take the Titanic data again, and ask a question. Thereafter we discuss the question and answer an answerable form of it.

8.3.1 The question

Consider the question:

Who were most likely to survive Titanic disaster?

To begin with–is it an interesting or an important question? It may definitely be an interesting one (but depends on your interests). Is it an important one? Not without a few qualifying remarks. The knowledge that Ms. Jones survived but Mr. Smith did not is not very important (unless you are someone close to them). But if we can learn something about why did Ms. Jones survive while Mr. Smith did not, then we may be able to tell something about the safety mechanisms and potentially offer ways to improve those. So to learn the names may not be particularly important, but to learn the causes may be useful.

Is this question answerable? Again, not without further qualifications. What does “most likely to survive” mean? Every passenger either did or did not survive. And those who survived, who of them was most likely to survive? But the question becomes meaningful if we look at certain groups, certain categories of passengers. For instance–were old or young more likely to survive? Men or women? 1st class or 3rd class passengers?

So we may re-phrase the question along the lines:

Which passenger group: male/female; 1st, 2nd, 3rd class; and which age categories, was most likely to survive the shipwreck?

What information do we need to answer this question? It is fairly obvious: we need passenger’ sex, travel class, and age. And obviously we also need to know if they survived or not. All these variables are in Titanic dataset.

8.3.2 Preliminary analysis

As always, we need to start with loading data. We can do it using the data importer (see Section 3.7), or just load it as:

We may also want to keep it simple and only select the relevant variables:

## # A tibble: 3 × 4
##   sex    pclass    age survived
##   <chr>   <dbl>  <dbl>    <dbl>
## 1 female      1 29            1
## 2 male        1  0.917        1
## 3 female      1  2            0

In this way we can easily print selected rows and not be distracted with a load of irrelevant columns.

Before we actually do any meaningful analysis, we should have an idea how much good information we have in the variables. In particular, are there missings, are there invalid values, and how many good values do we have.

We can test the number of missings (NA-s) in different ways. Here we use summary() as (see Section 6.4):

##    Length     Class      Mode 
##      1309 character character

Summary tells us that it is a “character” (i.e. text/categorical) variable, and it does not have any missings. summary() is a good way of doing things if you want to learn more about the variables than just the number of missings. For instance, it also tells what is the mean and range of numeric variables.

Alternatively, we may do is.na() and sum() (described in Section 6.4 too):

## [1] 0

This approach only tells us the number of missings, here “0” for no missings. It is a good choice if number of missings is the only thing you are interested in. We can also ask for summary() of the whole dataset and get a short report for each variable:

##      sex                pclass           age         
##  Length:1309        Min.   :1.000   Min.   : 0.1667  
##  Class :character   1st Qu.:2.000   1st Qu.:21.0000  
##  Mode  :character   Median :3.000   Median :28.0000  
##                     Mean   :2.295   Mean   :29.8811  
##                     3rd Qu.:3.000   3rd Qu.:39.0000  
##                     Max.   :3.000   Max.   :80.0000  
##                                     NA's   :263      
##     survived    
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.382  
##  3rd Qu.:1.000  
##  Max.   :1.000  
## 

This is an easy way to get a quick overview, but it may also give you too much irrelevant information.

All these ways of counting missings tell us the following: sex, pclass and survived do not contain any missing values. age contains 263 missings, so it is a somewhat problematic variable.

As sex is categorical, summary does not tell much about it. We may want to print the frequency table of its values:

## .
## female   male 
##    466    843

The table tells us that a) there are only two sexes in this dataset: “female” and “male” (in alphabetic order). It also tells that there were almost twice as many men (843) as women (466) on the boat. This is good to know–we learn that we do not have to deal with erroneous or missing sexes, such as empty strings, dots, or “N/A”. It also tells us that we have plenty of men and women for further analysis.

8.4 Sea ice cover: and example analysis

Above, we discussed the questions and possible answers through data. Now it is time to give an example. Let us move to the field of science, and try to answer the question:

When will we see the Arctic Ocean to be ice free for the first time in history?

We are using NSIDC sea ice index data, more specifically a simplified subset, available at the repo. For a little background, the Arctic Ocean has been mostly ice covered for millions of years. However, in the recent decades we see a clear downward trend, in particular in the summer cover (Arctic Ocean has its minimum ice cover in September). Will we see ice free Arctic Ocean soon?

Here we will do an example analysis, and discuss the presentation below (in Section 9.1).


As always, we cannot tell what happens in the future. Instead, what we can do is to look at the past trends, and then extrapolate the trends into future. So we may want to translate the original question into a new one:

When will the ice cover trend from the past four decade reach zero (if ever)?

Note that this is not the same question: first, it is about extending past trends, and it completely ignores the physical properties of the polar regions. And second, it talks about trends, but there is also a lot of movement around the trends, and the first ice-free summer will probably occur during in an exceptionally warm and sunny year. But looking at the past trends is still interesting, and if we see a linear trend, then we can expect that such a trend will last at least some time into the future.

Next, we need to know what is in data. A good way to check it is to print out a few lines of data:

## # A tibble: 3 × 7
##    year month `data-type` region extent  area  time
##   <dbl> <dbl> <chr>       <chr>   <dbl> <dbl> <dbl>
## 1  1978    11 Goddard     N        11.6  9.04 1979.
## 2  1978    11 Goddard     S        15.9 11.7  1979.
## 3  1978    12 Goddard     N        13.7 10.9  1979.

We can see that the rows represent months. The variables in the dataset are fairly self explanatory. The most important ones are “extent” and “area”. “Extent” is sea ice extent, the surface area (in km2) where sea ice concentration is at least 15%. “Area” is the sea surface area that is covered by ice. Because satellites have hard time distinguishing ice covered with water from true open sea, the extent measures tend to be more precise than area measures. “data-type” is the name of the data source, and we ignore it here. Finally, “region” is either “N” for the northern or “S” for the southern hemisphere.

Let us analyze how ice extent has changed through the last few decades. For this analysis, we need four variables: year, month, region and extent. As the first step we can just do a basic exploratory analysis of these four variables. But first, let’s select the subset:

## [1] 527   4

We have 515 observations (months) and 4 columns.

Next, we need to get an idea what is the date range and how good are the variables. We can rely on a simple summary here:

##       year          month           region              extent        
##  Min.   :1978   Min.   : 1.000   Length:527         Min.   :-9999.00  
##  1st Qu.:1989   1st Qu.: 3.500   Class :character   1st Qu.:    8.51  
##  Median :2000   Median : 6.000   Mode  :character   Median :   12.04  
##  Mean   :2000   Mean   : 6.493                      Mean   :  -26.61  
##  3rd Qu.:2011   3rd Qu.: 9.000                      3rd Qu.:   14.30  
##  Max.   :2022   Max.   :12.000                      Max.   :   16.34

All is well with year (1978-2021) and month (1-12). But minimum extent (-9999) is clearly not feasible, after all extent is an area and it cannot be negative. This is apparently a particular way to code missing values. How many such cases do we have?

## [1] 2

We have two rows with missing data. These can probably just be safely ignored, and hence we filter those out: