Chapter 6 Cleaning and preparing data

The section Pipes and dplyr: the easy way of data manipulation introduced a popular data manipulation framework through pipelines. Here we look at certain application of data cleaning, including handling missing values and information from character columns.

6.1 Missing values

Unfortunately, almost all datasets contain many missing values (often called missings). Missing value (R special value NA–not available)² means that we do not know the actual value for whatever reason. It is just not there in the dataset we are working with. So we have to live with missings, and there is little hope that the overall quality of data will ever improve.

It is important to realize that whatever we do with the missings, we may introduce bias in our results. The missing cases may or may not be similar to other cases, and if they are not, we necessarily base our inference on a subset of data that is not representative. In practice, it is hardly ever the case that we know why/how some cases are missing.³

Below, we are working with the Titanic data where various values are missing:

titanic <- read.csv("../data/titanic.csv.bz2")
dim(titanic)  # remember the consistency check!

## [1] 1309   14

What are the main strategies to handle missing values?

6.1.1 Counting and identifying missings

As the first step we probably want to know the distribution of missing values across the variables. There are several ways to do it. e.g. summary of a dataset reports the number of missings for each variables:

summary(titanic)

##      pclass         survived         name               sex           
##  Min.   :1.000   Min.   :0.000   Length:1309        Length:1309       
##  1st Qu.:2.000   1st Qu.:0.000   Class :character   Class :character  
##  Median :3.000   Median :0.000   Mode  :character   Mode  :character  
##  Mean   :2.295   Mean   :0.382                                        
##  3rd Qu.:3.000   3rd Qu.:1.000                                        
##  Max.   :3.000   Max.   :1.000                                        
##                                                                       
##       age              sibsp            parch          ticket         
##  Min.   : 0.1667   Min.   :0.0000   Min.   :0.000   Length:1309       
##  1st Qu.:21.0000   1st Qu.:0.0000   1st Qu.:0.000   Class :character  
##  Median :28.0000   Median :0.0000   Median :0.000   Mode  :character  
##  Mean   :29.8811   Mean   :0.4989   Mean   :0.385                     
##  3rd Qu.:39.0000   3rd Qu.:1.0000   3rd Qu.:0.000                     
##  Max.   :80.0000   Max.   :8.0000   Max.   :9.000                     
##  NA's   :263                                                          
##       fare            cabin             embarked             boat          
##  Min.   :  0.000   Length:1309        Length:1309        Length:1309       
##  1st Qu.:  7.896   Class :character   Class :character   Class :character  
##  Median : 14.454   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 33.295                                                           
##  3rd Qu.: 31.275                                                           
##  Max.   :512.329                                                           
##  NA's   :1                                                                 
##       body        home.dest        
##  Min.   :  1.0   Length:1309       
##  1st Qu.: 72.0   Class :character  
##  Median :155.0   Mode  :character  
##  Mean   :160.8                     
##  3rd Qu.:256.0                     
##  Max.   :328.0                     
##  NA's   :1188

The table indicates that we do not have any missings for pclass and survived but for instance age has 263 missing values and body is missing for 1188 cases in total.

For individual variables we can use e.g. sum(is.na(.)):

sum(is.na(titanic$fare))

## [1] 1

to reveal that there is just a single missing fare information. This approach can also be applied to the whole dataset using sapply:

sapply(titanic, function(x) sum(is.na(x)))

##    pclass  survived      name       sex       age     sibsp     parch    ticket 
##         0         0         0         0       263         0         0         0 
##      fare     cabin  embarked      boat      body home.dest 
##         1         0         0         0      1188         0

We get a named vector of counts of NA-s for each variable.

Unfortunately, is.na cannot identify all the missing values. It only identifies values that are coded as the special NA symbol. But there are other ways to mark missing values. Take an example of the variable boat:

titanic$boat %>%
   head(10)

##  [1] "2"  "11" ""   ""   ""   "3"  "10" ""   "D"  ""

We see that while some of the passengers have valid lifeboat numbers, e.g. “2” and “11”, there are also many empty strings. The empty string denotes missing value of the boat variable. Sadly, these passenger literally “missed the boat”, and died. Empty strings are a common way to denote missing character values. You may also encounter just whitespace records, or values "unknown", "-", and others.

In case of numeric data, the missing values are often coded as zeros, or some sort of invalid values. For instance, what is the smallest ticket price paid on titanic?

min(titanic$fare, na.rm=TRUE)

## [1] 0

Did some people (17 in fact) really pay nothing for the 2-week trip? Note that they are probably not stowaways—it is unlikely the investigators were able to collect stowaways’ data after the disaster. We probably have to treat these values as missings. In certain datasets, like sociological surveys, it is customary to code missings as invalid values. Fortunately good surveys document the values thoroughly in the codebook.

Hence another important step in identifying missings is to review of the values of the variables of interest, and try to identify values that are clearly invalid or out of range. For numeric values, we can use a handy function range that reports both minimum and maximum values. What is the recorded age range of Titanic passengers?

range(titanic$age, na.rm=TRUE)

## [1]  0.1667 80.0000

The good news is that all the ages are perfectly in the feasible range, between 2 months and 80 years of age. For character variables we may use unique or table instead:

unique(titanic$boat)

##  [1] "2"       "11"      ""        "3"       "10"      "D"       "4"      
##  [8] "9"       "6"       "B"       "8"       "A"       "5"       "7"      
## [15] "C"       "14"      "5 9"     "13"      "1"       "15"      "5 7"    
## [22] "8 10"    "12"      "16"      "13 15 B" "C D"     "15 16"   "13 15"

But we may have hard time to understand what/where is missing if there is a large number of different values. For instance, in case of boats, it is not immediately obvious what do double values, such as “8 10” mean. Does it mean the passenger was in one of these boats but we don’t know in which one? Or moved from one boat to another? Or just a data entry error? We don’t know.

How to identify missings:

Count NA values with summary or sum(is.na(.))
Find invalid numeric values with range
Review the documentation: does it list any special codes for missing values?
Find suspicious character values with unique or table

6.1.2 Removing missings

After we have identified the suspicious and missing values, the next step is to either remove those, or replace those with other values (imputed values). First we discuss how to remove missings.

But before we get our hands dirty we should look a bit further and ask what is the question we try to answer. There is no universally correct way to prepare data, we have always to keep the task in mind. Let us try to answer the following question: how many passengers were there in the boats? Were some boats overloaded while other boats had empty seats?⁴

One may naively use na.omit in order to remove the missings. However, this will remove all rows that contain a missing value, even for cases where we do want to preserve those. For instance,

na.omit(titanic) %>%
   nrow()

## [1] 119

will leave us with 119 observations only. This is because na.omit removes all missings, including missings in variables where NA does not hinder the analysis. For instance, if we also remove missing values for body (passenger’s body pulled out of water and identified), we should be left with no observations at all because no-one got into a boat while leaving a body floating at the same time… So we should only remove missing cases from the relevant variables!

We can achieve this using dplyr’s filtering:

boated <- titanic %>%
   filter(!is.na(boat),  # no NA-s in boat
          boat != "")  # no empty strings in 'boat'

which results in 486 passengers.

Exercise 6.1 Check out the function dplyr::na_if and use that function combined with filtering based on NA only to achieve the same result.

See the solution

We can try to answer our question using table:

table(boated$boat) %>%
   sort(decreasing=TRUE)

## 
##      13       C      15      14       4      10       5       3      11       9 
##      39      38      37      33      31      29      27      26      25      25 
##      16       7       8       6       D      12       2       A       B       1 
##      23      23      23      20      20      19      13      11       9       5 
##   13 15     5 7     C D 13 15 B   15 16     5 9    8 10 
##       2       2       2       1       1       1       1

But we are left with the other problem unsolved: how to treat the cases with double boat numbers like “15 16”? The simplest approach is to remove those as well, after all there are only 10 such cases. All these cases contain a space in the boat variable while none of the valid boat names do, so we can just also remove the cases with space. The filtering may look like

boated <- titanic %>%
   filter(!is.na(boat),  # no NA-s in boat
          boat != "",  # no empty strings in 'boat'
          !grepl(" ", boat))  # 'boat' contains no space

We are left with 476 passengers, and the number of passengers by boat is

table(boated$boat) %>%
   sort(decreasing=TRUE)

## 
## 13  C 15 14  4 10  5  3 11  9 16  7  8  6  D 12  2  A  B  1 
## 39 38 37 33 31 29 27 26 25 25 23 23 23 20 20 19 13 11  9  5

We can see that the most boats had between 20 and 40 passengers, but there were a few boats with less than 10 people in.

Unfortunately, our data cleaning most likely changed the results. When removing the cases with unclear boat number, we in a way removed passengers from boats. So if we are interested in the number of passengers in boats then the estimates we get here are too low. But it is unclear how much better can we do—we do not know if the unclear codes were all from a single boat, from multiple boats, or perhaps from the boats “B” and “1” where there were only few passengers. To have an idea of the magnitude of the problem, we can just count the number of such unclear cases:

sum(grepl(" ", titanic$boat))

## [1] 10

So we have 10 such cases. It is probably not changing our conclusion much.

6.1.3 Recoding missings

Another common strategy, instead of removing missings, is to recode those to a certain value. This is a form of imputing, i.e. inferring the value for a missing from the observable data. There is no general way of doing this right, as by definition missing values are missing, and because we do not know their value, we also cannot say if such imputation is correct.

However, sometimes such replacement is very much desirable. Imagine we have employee data as

employees <- data.frame(name=c("Yan Shao", "Cao Cao"),
                        phone=c("123-456", NA))
employees

##       name   phone
## 1 Yan Shao 123-456
## 2  Cao Cao    <NA>

The table works, but “NA” in place of the phone may look undesirable.

tidyr package offers various tools for replacing missings, including fill to fill a missing value with previous or next non-missing value, replace_na to replace NA-s with given values, na_if to replace a given value with NA, and recode to replace certain values with other values. Let us use replace_na to replace NA with “-”. Replace takes a data frame as the first argument, and a named list of values, where the component corresponds to the replacement value in that column:

library(tidyr)
employees %>%
   replace_na(list(phone="-"))

##       name   phone
## 1 Yan Shao 123-456
## 2  Cao Cao       -

Exercise 6.2 Consider the following data (Northern ice extent 1987):

extent <- data.frame(month=9:12, extent=c(7.28, 9.05, 11.22, -9999))
extent

##   month   extent
## 1     9     7.28
## 2    10     9.05
## 3    11    11.22
## 4    12 -9999.00

How is the missing value coded?
use dplyr::na_if to convert it to NA
thereafter use tidyr::fill to replace it with the value for November.

See the solution

6.2 Recoding variables

There are multiple reasons to change the way the variables are coded. A common case is where we want to collapse several categories into a single one, or a continuous range of values into a few categories. Alternatively, we may want to use more suitable names, e.g. “male” and “female” instead of “1” and “2” while leaving the categories otherwise intact.

We look at two different tasks below:

splitting a continous variable into categories
combining existing categories into more general categories

6.2.1 Creating groups of continuous variables

When analyzing how human income or health changes through the life cycle we usually want to create simple and easily understandable age groups. For instance, “below 25”, “25-35”, “35-50”, and “over 50”. This can be achieved with cut function:

## Demonstrate on a few random age values.
age <- runif(10, 20, 60)
ageGroup <- cut(age,
                breaks=c(-Inf, 25, 35, 50, Inf))
data.frame(age, ageGroup)

##         age  ageGroup
## 1  28.86406   (25,35]
## 2  20.96936 (-Inf,25]
## 3  28.28476   (25,35]
## 4  28.62934   (25,35]
## 5  37.74894   (35,50]
## 6  25.36305   (25,35]
## 7  35.62631   (35,50]
## 8  34.77263   (25,35]
## 9  46.74772   (35,50]
## 10 59.68440 (50, Inf]

We create four different age brackets by specifying the corresponding boundaries. The first group has infinite lower bound, this is a way to tell “below 25”.

This example is simple but the simplicity is a bit misleading. First, we may want to use somewhat more conventional age group names, and second–age is hardly ever coded as a continuous variable. Almost all dataset contain age as integer number of years. Let’s make another example where we demonstrate custom group names and the boundary decision. Say, this time we do not want to pour all below-25-year-olds into a single category, we only want to include people 15 and above. We also hand-carve the age variable as integers:

age <- c(14,15,18,25,30,35)
ageGroup <- cut(age,
                breaks=c(15, 25, 35, 50, Inf),
                labels=c("15-25", "25-35", "35-50", "50+"))
data.frame(age, ageGroup)

##   age ageGroup
## 1  14     <NA>
## 2  15     <NA>
## 3  18    15-25
## 4  25    15-25
## 5  30    25-35
## 6  35    25-35

Now the age group names are good but the example reveals two other problems:

the lowest age of interest–15–is not included into the groups, and has turned into NA, exactly as age 14. The latter is correct but the former is not.
the boundary ages, here 25 and 35 are assigned into the lower age category, e.g. 25 goes to “15-25”, not “25-35”. The problem is partly of our own making: even humans cannot tell in exactly which age groups should “25” belong.

In order to overcome these problems, we have to amend the example. First, let’s decide we want age 25 to belong to the upper group, “25-35”, and the same for other boundary age values. Second, let’s unambiguously rename the age groups to “15-24”, “25-34” and “35-49”. Third, we want to tell cut that the lowest age we consider, “15”, in fact belongs to the first group by specifying include.lowest=TRUE. And finally, we say that the right-hand boundary does not belong to this group by specifying right=FALSE:

ageGroup <- cut(age,
                breaks=c(15, 25, 35, 50, Inf),
                labels=c("15-24", "25-34", "35-49", "50+"),
                include.lowest=TRUE,
                           # include 15 in the '15-24' group 
                right=FALSE
                           # the boundary (e.g. 25) will belong to
                           # 'upper' group (25-34), no to the
                           # 'lower' group (15-24)
                )
data.frame(age, ageGroup)

##   age ageGroup
## 1  14     <NA>
## 2  15    15-24
## 3  18    15-24
## 4  25    25-34
## 5  30    25-34
## 6  35    35-49

While the breaks and labels arguments are easy to remember, include.lowest and right are much more confusing. This makes using cut a somewhat less convenient option.

TBD: exercise

6.2.2 Rearranging existing groups

Often we want to re-arrange existing groups. For instance, we may have district data that we want to collapse into provinces. Say, the data includes the following districts (Afghanistan district/province names): Gulran (Hirat province), Koshk (Hirat), Sanda Jan (Hirat), Spin Boldak (Kandahar), Arghistan (Kandahar), Sari Pul (Sari Pul) and Sayyad (Sari Pul). Such a task requires to create a lookup table, in one form or another. Here we create such table manually, and thereafter use character indexing to do the grouping.

6.2.2.1 Lookup table: named vector

R does not have general lookup tables, but one can get fairly far with named vectors. Namely, indexing vectors by character strings performs vectorized name lookup (see Indexing and Named Vectors), so one can easily create key-value structures as far as the keys are character strings (vector names). Below is an example:

## We need to convert these districts to provinces
districts <- c("Sari Pul", "Spin Boldak", "Arghistan", "Gulran", "Koshk")
## create lookup table
provinceTable <- c(Gulran = "Hirat",
                   Koshk = "Hirat",
                   "Sanda Jan" = "Hirat",
                   "Spin Boldak" = "Kandahar",
                   Arghistan = "Kandahar",
                   "Sari Pul" = "Sari Pul",
                   "Sayyad" = "Sari Pul")
## and use character indexing:
provinceTable[districts]

##    Sari Pul Spin Boldak   Arghistan      Gulran       Koshk 
##  "Sari Pul"  "Kandahar"  "Kandahar"     "Hirat"     "Hirat"

We create the lookup table as named vector, e.g. the first element has value “Hirat” and name “Gulran”, second value is also “Hirat” but name is “Koshk”, and so on. Note that we do not have to quote the names if they do not contain a space, or do not begin with a number. The conversion–table lookup–is the last row in the example code, it just find elements that have names, stored in the variable districts. Note the output contains two lines–the second line are the values, the first line the corresponding (district) names. Names are usually harmless but we can discard those if needed.

Named vector approach is often very convenient, in particular where there are many values to categorize. It can be performed in different ways, e.g. by using match instead of named vectors, or fastmatch::fmatch if one wants to speed up the lookup with hash tables.

6.2.2.2 Manually replacing the values

Another handy alternative to lookup tables it just to use logical indexing to manually fill out the province table. We can achieve this as follows:

## create empty list of provinces
provinces <- character(length(districts))
## 'Herat' for all Herat districts
provinces[districts %in% c("Gulran", "Koshk", "Sanda Jan")] <- "Hirat"
## 'Kandahar' for all respective districts
provinces[districts %in% c("Spin Boldak", "Arghistan")] <- "Kandahar"
##
provinces[districts %in% c("Sari Pul", "Sayyad")] <- "Sari Pul"
## We are done, undefined will remain empty strings
provinces

## [1] "Sari Pul" "Kandahar" "Kandahar" "Hirat"    "Hirat"

We start the example by creating and empty character vector for each district, and thereafter use %in% operator to check if districts belong to the district list for each province. We fill in the province name for those districts.

The final result is the same as in case of the lookup tables.

Note that this is distinct value from NaN–not a number. NaN is the result of undefined mathematical operations, such as 0/0, and does not mean that we do not know the result.↩︎
An example where we can safely ignore missings is when someone randomly deleted some of the values, for instance when for confidentiality reasons we are only given a subset of the larger dataset↩︎
We know from the survivors’ accounts that the boats weren’t overloaded, some were near capacity while the others where half-empty↩︎