Chapter 6 Cleaning and preparing data
The section Pipes and dplyr: the easy way of data manipulation introduced a popular data manipulation framework through pipelines. Here we look at certain application of data cleaning, including handling missing values and information from character columns.
6.1 Missing values
Unfortunately, almost all datasets contain many missing values (often
called missings).
Missing value (R special value NA
–not available)2
means that we do not know the
actual value for whatever reason. It is just not there in the dataset
we are working with. So we have to live with missings, and there is
little hope that the overall quality of data will ever improve.
It is important to realize that whatever we do with the missings, we may introduce bias in our results. The missing cases may or may not be similar to other cases, and if they are not, we necessarily base our inference on a subset of data that is not representative. In practice, it is hardly ever the case that we know why/how some cases are missing.3
Below, we are working with the Titanic data where various values are missing:
read.csv("../data/titanic.csv.bz2")
titanic <-dim(titanic) # remember the consistency check!
## [1] 1309 14
What are the main strategies to handle missing values?
6.1.1 Counting and identifying missings
As the first step we probably want to know the distribution of missing
values across the variables. There are several ways to do
it. e.g. summary
of a dataset reports the number of missings for
each variables:
summary(titanic)
## pclass survived name sex
## Min. :1.000 Min. :0.000 Length:1309 Length:1309
## 1st Qu.:2.000 1st Qu.:0.000 Class :character Class :character
## Median :3.000 Median :0.000 Mode :character Mode :character
## Mean :2.295 Mean :0.382
## 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :3.000 Max. :1.000
##
## age sibsp parch ticket
## Min. : 0.1667 Min. :0.0000 Min. :0.000 Length:1309
## 1st Qu.:21.0000 1st Qu.:0.0000 1st Qu.:0.000 Class :character
## Median :28.0000 Median :0.0000 Median :0.000 Mode :character
## Mean :29.8811 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.0000 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.0000 Max. :8.0000 Max. :9.000
## NA's :263
## fare cabin embarked boat
## Min. : 0.000 Length:1309 Length:1309 Length:1309
## 1st Qu.: 7.896 Class :character Class :character Class :character
## Median : 14.454 Mode :character Mode :character Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
## body home.dest
## Min. : 1.0 Length:1309
## 1st Qu.: 72.0 Class :character
## Median :155.0 Mode :character
## Mean :160.8
## 3rd Qu.:256.0
## Max. :328.0
## NA's :1188
The table indicates that we do not have any missings for pclass and survived but for instance age has 263 missing values and body is missing for 1188 cases in total.
For individual variables we can use e.g. sum(is.na(.))
:
sum(is.na(titanic$fare))
## [1] 1
to reveal that there is just a single missing fare information. This
approach can also be applied to the whole dataset using sapply
:
sapply(titanic, function(x) sum(is.na(x)))
## pclass survived name sex age sibsp parch ticket
## 0 0 0 0 263 0 0 0
## fare cabin embarked boat body home.dest
## 1 0 0 0 1188 0
We get a named vector of counts of NA
-s for each variable.
Unfortunately, is.na
cannot identify all the missing values. It
only identifies values that are coded as the special NA
symbol. But
there are other ways to mark missing values. Take an example of the
variable boat:
$boat %>%
titanic head(10)
## [1] "2" "11" "" "" "" "3" "10" "" "D" ""
We see that while some of the passengers have valid lifeboat numbers,
e.g. “2” and “11”, there are also many empty strings. The
empty string denotes missing value of the boat variable. Sadly,
these passenger literally “missed the boat”, and died.
Empty strings are a common way to denote missing character values.
You may also encounter just whitespace records, or values
"unknown"
, "-"
, and others.
In case of numeric data, the missing values are often coded as zeros, or some sort of invalid values. For instance, what is the smallest ticket price paid on titanic?
min(titanic$fare, na.rm=TRUE)
## [1] 0
Did some people (17 in fact) really pay nothing for the 2-week trip? Note that they are probably not stowaways—it is unlikely the investigators were able to collect stowaways’ data after the disaster. We probably have to treat these values as missings. In certain datasets, like sociological surveys, it is customary to code missings as invalid values. Fortunately good surveys document the values thoroughly in the codebook.
Hence another important step in identifying missings is to review of
the values of the variables of interest, and try to identify values
that are clearly invalid or out of range. For numeric values, we can
use a handy function range
that reports both minimum and maximum
values.
What is the recorded age range of Titanic passengers?
range(titanic$age, na.rm=TRUE)
## [1] 0.1667 80.0000
The good news is that all the ages are perfectly in the feasible range,
between 2 months and 80 years of age. For character variables we may
use unique
or table
instead:
unique(titanic$boat)
## [1] "2" "11" "" "3" "10" "D" "4"
## [8] "9" "6" "B" "8" "A" "5" "7"
## [15] "C" "14" "5 9" "13" "1" "15" "5 7"
## [22] "8 10" "12" "16" "13 15 B" "C D" "15 16" "13 15"
But we may have hard time to understand what/where is missing if there is a large number of different values. For instance, in case of boats, it is not immediately obvious what do double values, such as “8 10” mean. Does it mean the passenger was in one of these boats but we don’t know in which one? Or moved from one boat to another? Or just a data entry error? We don’t know.
How to identify missings:
- Count
NA
values withsummary
orsum(is.na(.))
- Find invalid numeric values with
range
- Review the documentation: does it list any special codes for missing values?
- Find suspicious character values with
unique
ortable
6.1.2 Removing missings
After we have identified the suspicious and missing values, the next step is to either remove those, or replace those with other values (imputed values). First we discuss how to remove missings.
But before we get our hands dirty we should look a bit further and ask what is the question we try to answer. There is no universally correct way to prepare data, we have always to keep the task in mind. Let us try to answer the following question: how many passengers were there in the boats? Were some boats overloaded while other boats had empty seats?4
One may naively use na.omit
in order to remove the missings.
However, this will remove all rows that contain a missing value,
even for cases where we do want to preserve those. For instance,
na.omit(titanic) %>%
nrow()
## [1] 119
will leave us with 119 observations only. This is because na.omit
removes all
missings, including missings in variables where NA
does not
hinder the analysis. For instance, if we also remove missing values
for body (passenger’s body pulled out of water and identified), we
should be left with no observations at all because no-one got into a boat
while leaving a body floating at the same time… So we should only
remove missing cases from the relevant variables!
We can achieve this using dplyr’s filtering:
titanic %>%
boated <- filter(!is.na(boat), # no NA-s in boat
!= "") # no empty strings in 'boat' boat
which results in 486 passengers.
Exercise 6.1 Check out the function dplyr::na_if
and use that function combined
with filtering based on NA
only to achieve the same result.
See the solution
We can try to answer our question using table
:
table(boated$boat) %>%
sort(decreasing=TRUE)
##
## 13 C 15 14 4 10 5 3 11 9
## 39 38 37 33 31 29 27 26 25 25
## 16 7 8 6 D 12 2 A B 1
## 23 23 23 20 20 19 13 11 9 5
## 13 15 5 7 C D 13 15 B 15 16 5 9 8 10
## 2 2 2 1 1 1 1
But we are left with the other problem unsolved: how to treat the cases with double boat numbers like “15 16”? The simplest approach is to remove those as well, after all there are only 10 such cases. All these cases contain a space in the boat variable while none of the valid boat names do, so we can just also remove the cases with space. The filtering may look like
titanic %>%
boated <- filter(!is.na(boat), # no NA-s in boat
!= "", # no empty strings in 'boat'
boat !grepl(" ", boat)) # 'boat' contains no space
We are left with 476 passengers, and the number of passengers by boat is
table(boated$boat) %>%
sort(decreasing=TRUE)
##
## 13 C 15 14 4 10 5 3 11 9 16 7 8 6 D 12 2 A B 1
## 39 38 37 33 31 29 27 26 25 25 23 23 23 20 20 19 13 11 9 5
We can see that the most boats had between 20 and 40 passengers, but there were a few boats with less than 10 people in.
Unfortunately, our data cleaning most likely changed the results. When removing the cases with unclear boat number, we in a way removed passengers from boats. So if we are interested in the number of passengers in boats then the estimates we get here are too low. But it is unclear how much better can we do—we do not know if the unclear codes were all from a single boat, from multiple boats, or perhaps from the boats “B” and “1” where there were only few passengers. To have an idea of the magnitude of the problem, we can just count the number of such unclear cases:
sum(grepl(" ", titanic$boat))
## [1] 10
So we have 10 such cases. It is probably not changing our conclusion much.
6.1.3 Recoding missings
Another common strategy, instead of removing missings, is to recode those to a certain value. This is a form of imputing, i.e. inferring the value for a missing from the observable data. There is no general way of doing this right, as by definition missing values are missing, and because we do not know their value, we also cannot say if such imputation is correct.
However, sometimes such replacement is very much desirable. Imagine we have employee data as
data.frame(name=c("Yan Shao", "Cao Cao"),
employees <-phone=c("123-456", NA))
employees
## name phone
## 1 Yan Shao 123-456
## 2 Cao Cao <NA>
The table works, but “NA” in place of the phone may look undesirable.
tidyr package offers various tools for replacing missings, including
fill
to fill a missing value with previous or next non-missing
value, replace_na
to replace NA
-s with given values, na_if
to
replace a given value with NA
, and recode
to replace certain
values with other values. Let us use replace_na
to replace NA
with “-”. Replace takes a data frame as the first argument, and a
named list of values, where the component corresponds to the
replacement value in that column:
library(tidyr)
%>%
employees replace_na(list(phone="-"))
## name phone
## 1 Yan Shao 123-456
## 2 Cao Cao -
Exercise 6.2 Consider the following data (Northern ice extent 1987):
data.frame(month=9:12, extent=c(7.28, 9.05, 11.22, -9999))
extent <- extent
## month extent
## 1 9 7.28
## 2 10 9.05
## 3 11 11.22
## 4 12 -9999.00
- How is the missing value coded?
- use
dplyr::na_if
to convert it toNA
- thereafter use
tidyr::fill
to replace it with the value for November.
See the solution
6.2 Recoding variables
There are multiple reasons to change the way the variables are coded. A common case is where we want to collapse several categories into a single one, or a continuous range of values into a few categories. Alternatively, we may want to use more suitable names, e.g. “male” and “female” instead of “1” and “2” while leaving the categories otherwise intact.
We look at two different tasks below:
- splitting a continous variable into categories
- combining existing categories into more general categories
6.2.1 Creating groups of continuous variables
When analyzing how human income or health changes through the
life cycle we usually want to create simple and easily understandable
age groups. For instance, “below 25”, “25-35”, “35-50”, and “over
50”. This can be achieved with cut
function:
## Demonstrate on a few random age values.
runif(10, 20, 60)
age <- cut(age,
ageGroup <-breaks=c(-Inf, 25, 35, 50, Inf))
data.frame(age, ageGroup)
## age ageGroup
## 1 28.86406 (25,35]
## 2 20.96936 (-Inf,25]
## 3 28.28476 (25,35]
## 4 28.62934 (25,35]
## 5 37.74894 (35,50]
## 6 25.36305 (25,35]
## 7 35.62631 (35,50]
## 8 34.77263 (25,35]
## 9 46.74772 (35,50]
## 10 59.68440 (50, Inf]
We create four different age brackets by specifying the corresponding boundaries. The first group has infinite lower bound, this is a way to tell “below 25”.
This example is simple but the simplicity is a bit misleading. First, we may want to use somewhat more conventional age group names, and second–age is hardly ever coded as a continuous variable. Almost all dataset contain age as integer number of years. Let’s make another example where we demonstrate custom group names and the boundary decision. Say, this time we do not want to pour all below-25-year-olds into a single category, we only want to include people 15 and above. We also hand-carve the age variable as integers:
c(14,15,18,25,30,35)
age <- cut(age,
ageGroup <-breaks=c(15, 25, 35, 50, Inf),
labels=c("15-25", "25-35", "35-50", "50+"))
data.frame(age, ageGroup)
## age ageGroup
## 1 14 <NA>
## 2 15 <NA>
## 3 18 15-25
## 4 25 15-25
## 5 30 25-35
## 6 35 25-35
Now the age group names are good but the example reveals two other problems:
- the lowest age of interest–15–is not included into the groups, and has turned into NA, exactly as age 14. The latter is correct but the former is not.
- the boundary ages, here 25 and 35 are assigned into the lower age category, e.g. 25 goes to “15-25”, not “25-35”. The problem is partly of our own making: even humans cannot tell in exactly which age groups should “25” belong.
In order to overcome these problems, we have to amend the example.
First, let’s decide we want age 25 to belong to the upper group,
“25-35”, and the same for other boundary age values. Second, let’s
unambiguously rename the age groups to “15-24”, “25-34” and “35-49”.
Third, we want to tell cut
that the lowest age we consider,
“15”, in fact belongs to the first group by specifying
include.lowest=TRUE
. And finally, we say that the right-hand
boundary does not belong to this group by specifying right=FALSE
:
cut(age,
ageGroup <-breaks=c(15, 25, 35, 50, Inf),
labels=c("15-24", "25-34", "35-49", "50+"),
include.lowest=TRUE,
# include 15 in the '15-24' group
right=FALSE
# the boundary (e.g. 25) will belong to
# 'upper' group (25-34), no to the
# 'lower' group (15-24)
)data.frame(age, ageGroup)
## age ageGroup
## 1 14 <NA>
## 2 15 15-24
## 3 18 15-24
## 4 25 25-34
## 5 30 25-34
## 6 35 35-49
While the breaks
and labels
arguments are easy to remember,
include.lowest
and right
are much more confusing. This makes
using cut
a somewhat less convenient option.
TBD: exercise
6.2.2 Rearranging existing groups
Often we want to re-arrange existing groups. For instance, we may have district data that we want to collapse into provinces. Say, the data includes the following districts (Afghanistan district/province names): Gulran (Hirat province), Koshk (Hirat), Sanda Jan (Hirat), Spin Boldak (Kandahar), Arghistan (Kandahar), Sari Pul (Sari Pul) and Sayyad (Sari Pul). Such a task requires, in one for or another, to create a lookup table. First we create such table explicitly, and later we use just lists of districts by province to do the grouping.
6.2.2.1 Lookup table: named vector
R does not have general lookup tables, but one can get fairly far with named vectors. Namely, indexing vectors by character strings performs vectorized name lookup (see Indexing and Named Vectors), so one can easily create key-value structures as far as the keys are character strings (vector names). Below is an example:
## We need to convert these districts to provinces
c("Sari Pul", "Spin Boldak", "Arghistan", "Gulran", "Koshk")
districts <-## create lookup table
c(Gulran = "Hirat",
provinceTable <-Koshk = "Hirat",
"Sanda Jan" = "Hirat",
"Spin Boldak" = "Kandahar",
Arghistan = "Kandahar",
"Sari Pul" = "Sari Pul",
"Sayyad" = "Sari Pul")
## and use character indexing:
provinceTable[districts]
## Sari Pul Spin Boldak Arghistan Gulran Koshk
## "Sari Pul" "Kandahar" "Kandahar" "Hirat" "Hirat"
We create the lookup table as named vector, e.g. the first element has value “Hirat” and name “Gulran”, second value is also “Hirat” but name is “Koshk”, and so on. Note that we do not have to quote the names if they do not contain a space, or do not begin with a number. The conversion–table lookup–is the last row in the example code, it just find elements that have names, stored in the variable districts. Note the output contains two lines–the second line are the values, the first line the corresponding (district) names. Names are usually harmless but we can discard those if needed.
Named vector approach is often very convenient, in particular where
there are many values to categorize. It can be performed in different
ways, e.g. by using match
instead of named vectors, or
fastmatch::fmatch
if one wants to speed up the lookup with hash
tables.
6.2.2.2 Manually replacing the values
Another handy alternative to lookup tables it just to use logical indexing to manually fill out the province table. We can achieve this as follows:
## create empty list of provinces
character(length(districts))
provinces <-## 'Herat' for all Herat districts
%in% c("Gulran", "Koshk", "Sanda Jan")] <- "Hirat"
provinces[districts ## 'Kandahar' for all respective districts
%in% c("Spin Boldak", "Arghistan")] <- "Kandahar"
provinces[districts ##
%in% c("Sari Pul", "Sayyad")] <- "Sari Pul"
provinces[districts ## We are done, undefined will remain empty strings
provinces
## [1] "Sari Pul" "Kandahar" "Kandahar" "Hirat" "Hirat"
We start the example by creating and empty character vector for each
district, and thereafter use %in%
operator to check if districts
belong to the district list for each province. We fill in the
province name for those districts.
The final result is the same as in case of the lookup tables.
Note that this is distinct value from
NaN
–not a number.NaN
is the result of undefined mathematical operations, such as 0/0, and does not mean that we do not know the result.↩︎An example where we can safely ignore missings is when someone randomly deleted some of the values, for instance when for confidentiality reasons we are only given a subset of the larger dataset↩︎
We know from the survivors’ accounts that the boats weren’t overloaded, some were near capacity while the others where half-empty↩︎