Chapter 6 Cleaning and Manipulating Data

This section explains and demonstrates certain data cleaning and preparation tasks using pandas. The task here is mostly to introduce you to various useful functions and show how to solve common task. We do not talk much about any fundamental data processing problem.

We use the following libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

6.1 Missing observations

In most datasets we encounter some values are missing. Such missing values, often referred to as just missings, will destroy all computations we do with data. Obviously, whatever we want to achieve, we cannot do this if the data is not there. We have to learn to handle missings and live with missings.

6.1.1 How is missing data coded

Python has a dedicated data type for missing–np.nan. Internally, this corresponds to IEEE not-a-number code, and if a nan enters computations, the result will in most cases be missing too. Note the two implications of python having just a single type of missing value:

  • As np.nan is numeric, a series with missing value must be numeric, or of generic type object. For instance, when adding a missing value to a logical series, the result will be of type _object.
  • In languages, such as R, there where there are both “not-a-number” (NaN) and “not available” (NA) values, NaN is typically used to denote invalid mathematical results (such as log of negative number), and NA is used to denote for missing data. If such a distinction is needed, one can add a separate variable and encode the type and reason for missingness there.

But not all missing values are coded using np.nan. For instance, in social sciences, it is customary to denote missings as negative numbers. World Value Survey codes all missing responses as -1: don’t know; -2: no answer; -3: not applicable; -4 not asked in survey; -5: missing, unknown. Other surveys may use positive out-of-range values, e.g. missing income may be coded as 90000 or 99999 or something similar. It is important to be aware of the coding of missings in the current data.

In practice, one also encounters missings that are not specifically coded. For instance, one may see age “200”. This is clearly wrong (as long as we talk about human age in years), and the documentation may be quiet about such values. We typically do not know why such values are there. Perhaps it is typo in “20”, or perhaps it is some sort of undocumented coding.

Below we walk through the main tools in pandas and numpy that help to identify, remove, or replace missing values. However, as the dedicated tools only work with np.nan codes, we also give examples about how to handle custom codes and data entry errors.

6.1.2 Removing missing observations

6.1.2.1 Handling np.nan-s

We demonstrate working with missings using Titanic data:

titanic = pd.read_csv("../data/titanic.csv.bz2")
titanic.shape
## (1309, 14)

So the original data contains 1309 observations.

Before we start removing missings, it is instructive to see which variables are the most problematic ones. We can start by method count: this tells the number of valid observations for either a data frame, or for a single variable. And “valid” here means anything but not np.nan. So the number of valid observations for each variable is

titanic.count()
## pclass       1309
## survived     1309
## name         1309
## sex          1309
## age          1046
## sibsp        1309
## parch        1309
## ticket       1309
## fare         1308
## cabin         295
## embarked     1307
## boat          486
## body          121
## home.dest     745
## dtype: int64

We can see that some variables, e.g. pclass and survived have valid values for all 1309 cases, while body only has 121 valid values.

If we want to count missings (i.e. np.nan-s) instead of the valid values, we can can combine two methods: isna (or isnull) tells for each observation if it is missing (true or false), and sum adds these trues and falses, converting trues to ones and falses to zeros. As count, this approach can produce the count of missings for every variable if applied to a data frame, or for just a single variable if applied to a series. So we can get the number of missings as:

titanic.isna().sum()
## pclass          0
## survived        0
## name            0
## sex             0
## age           263
## sibsp           0
## parch           0
## ticket          0
## fare            1
## cabin        1014
## embarked        2
## boat          823
## body         1188
## home.dest     564
## dtype: int64

Obviously, we can get the same result by 1309 - titanic.count().

These two examples demonstrated count and isna as applied to data frames. If we just want to count missings in a single variable, we can just select that variable (as a series):

titanic.boat.count()
## 486

applies method count to a series titanic.boat.

A good way to remove missing observations is by using the dropna method. By default, it removes all rows that contain a missing value. If our analysis only requires certain variables, we should not remove all such rows, we should remove just those rows where our important variables are missing. Imagine we want to analyze variables age and home.dest. We can remove the missing rows in only those two variables by using the argument subset:

t = titanic.dropna(subset=["age", "home.dest"])
t.shape
## (685, 14)

We are left with 685 rows. However, if we carelessly remove rows that have any missing value

t = titanic.dropna()
t.shape
## (0, 14)

We have no observations left! There are not a single row in Titanic data that has all valid values. As the tables above suggest, the main culprits are cabin, body and boat. In particular, those who made it to a boat (and have a valid value for boad) did not die, and so their body was not pulled out from water. Hence variable body is missing.

While this error here is easy to spot—all our analysis below will fail with no observatiosn—in other cases such error may lead to suspicious results, or even to results that are wrong but not even suspicious. Imagine the case where dropping missings will shrink out dataset by 90% because an irrelevant variable is mostly missing. Unless we check the resulting number of rows, we may complete our analysis on a completely wrong subset.

6.1.2.2 Handling other types of missing coding

As explained above, dropna only drops the technical missing values, the special value of np.nan. But missings are often coded in many different ways. For instance, missing age may be marked as a negative number, or another number out of realistic range (e.g. 900). Missing string may be marked as empty string, as “N/A” or in a countless other ways. dropna does not remove such cases.

The first step one should take here is to consult the data documentation. Some datasets, such as large sociological surveys, have fairly extensive documentation where coding of values is very well explained. But many datasets have very sparse documentation if any documentation at all, and even well-documented high-quality datasets may contain errors.

In such cases one should rely on descriptive methods to spot eventual suspicious values. In case of numeric values the most obvious starting point is to check the range. For instance, what is the smallest and largest value of age in Titanic data?

titanic.age.min(), titanic.age.max()
## (0.1667, 80.0)

We see that the youngest passenger is 2 months and the oldest 80 years old. This is definitely within the realistic age range, so there are is no reason to remove certain cases based on age. Note that what we see tells us that all age values are within a reasonable range, not that they are correct. They may well be all wrong!

Exercise 6.1 Does the variable fare contain any missings (np.nan-s)? What about its values: are all these reasonable, or are there values you think are out of reasonable range?

See the solution

6.1.3 Replacing missing values

Sometimes we want to replace all missings with a given value. This can be done using the ordinary data frame tools. Consider a series

populations = pd.Series([32.7, 267.7, np.nan], index=["MY", "ID", "KH"])
populations
## MY     32.7
## ID    267.7
## KH      NaN
## dtype: float64

We can fill a desired value, e.g. 0, using

populations[populations.isna()] = 0
populations
## MY     32.7
## ID    267.7
## KH      0.0
## dtype: float64

Alternatively, we can use fillna method to replace all missings with certain values.

However, it is not a suitable method in all cases. For instance, when working with a time series, it may make more sense to fill in the rare missing using the previous (or the following) value, instead of a constant value for all the year. Look at the following example:

populations = pd.Series([23_618_200, np.nan, 23_726_460],
                        index=[2016, 2017, 2018])  # TW population
populations
## 2016    23618200.0
## 2017           NaN
## 2018    23726460.0
## dtype: float64

As the population numbers are rather similar, we may opt for a simple method–use the 2016 population value for year 2017 too, instead of interpolation. This can be achieved with Series.fillna method:

populations.fillna(method="ffill")
## 2016    23618200.0
## 2017    23618200.0
## 2018    23726460.0
## dtype: float64

However, one must be careful when filling in missing data. Choosing a previous value is justified when the series describes the same object. But sometimes the series describe different objects. Consider the case:

populations = pd.DataFrame({"population": [23_618_200, 23_674_546,
                                           np.nan, 94_600_648],
                            "country": ["TW", "TW", "VN", "VN"]}, 
                        index=[2016, 2017, 2016, 2017])  # TW population

Here the first two rows describe Taiwan and the following two Vietnam. If we just use .fillna() with no considerations about the countries we’ll get

populations.fillna(method="ffill")
##       population country
## 2016  23618200.0      TW
## 2017  23674546.0      TW
## 2016  23674546.0      VN
## 2017  94600648.0      VN

i.e. the populations of Taiwan of 2017 is used as proxy for population in Vietnam in 2016! This is clearly nonsense. The solution is to use grouping:

populations.groupby("country").fillna(method="ffill")
##       population
## 2016  23618200.0
## 2017  23674546.0
## 2016         NaN
## 2017  94600648.0

Currently the data was not amended because there is no earlier population value for Vietnam that can be used to fill the missing 2016. (But in this particular case you may consider method="bfill" instead to fill the values backward.)

See also more complex methods, e.g. DataFrame.interpolate.

6.1.4 Missing values and mathematical operations

Pandas’ methods .sum and .mean ignore missings. Consider a series of passenger ages

age = pd.Series([5, 25, np.nan])

The average age in this age example, when ignoring missings, is 15. If we use .mean, then that indeed what we get:

age.mean()
## 15.0

However, if you sum and divide by the length of age, the answer will be 10:

age.sum()/len(age)
## 10.0

This is because len does not ignore missings, and hence we divide the sum (30) by the total number of cases (3).

However, when we compare missing value with a number, it results not in a missing value, but in “False”. For instance, when we want to compute the proportion of children in data, we may do it as

child = age < 14
child
## 0     True
## 1    False
## 2    False
## dtype: bool

We lost the information about a missing value in the process, and hence

(age < 14).mean()
## 0.3333333333333333

is 0.333, not 0.5 as we might have though.

This stresses the importance of addressing missing values in data, and not just relying on pandas default functionality.

6.2 Converting variables

6.2.1 Converting Categorical Variables

Sometimes we encounter situtations where the categories provided in the data do not correspond well to the needs of our analysis. Two common cases are if the original data contains too granular information (e.g. while we are interested in a few broad economic sectors but the data lists 27 industries), or categories that are too verbose (e.g. occupation may be listed as “Craftsmen, Foremen and kindred” while we may prefer shorter “skilled” instead). In these cases we have to convert the original categories into new ones, and possible merge several original ones into a single new category. In this case there are several ways one may consider to proceed. We demonstrate the options using the case where data is numerically coded, but not all codes correspond to real numeric values.

Imagine a case where we are interested in number of children in family. This question is often present in various surveys. For instance, World Value Survey provides the number of children as an integer between 0 and 8, where 8 means “eight or more”. In addition, negative numbers mean various types of missing data (-1: don’t know, -2: no answer, -3: not applicable, -4: not asked, -5: “inappropriate response”). But assume we just want a 3-category response either “0”, “1”, or “2 or more”, and all missings lumped together into a NA.

Consider this test data:

wvs = pd.DataFrame({"kids": [0,2,4,8,-1,-2]})
wvs
##    kids
## 0     0
## 1     2
## 2     4
## 3     8
## 4    -1
## 5    -2

6.2.1.1 Using pd.cut

pd.cut is a dedicated pandas’ function to cut continuous numeric data into pre-defined intervals. It is called as

pd.cut(x, bins, right=True, labels=None)

where x is the variable to be cut, bins are the interval boundaries, labels are the interval names, and right tells if the right boundary value is included in the interval (right = True means the boundaries are righ-open). It is designed primarily for cutting continous values, but it can be easily used for discrete values too. Let’s split this data into desired intervals:

wvs["kids1"] = pd.cut(wvs.kids,
                      bins = [-np.inf, 0, 1, 2, np.inf],
                      labels = ["NA", "0", "1", "2 or more"],
                      right=False)
wvs
##    kids      kids1
## 0     0          0
## 1     2  2 or more
## 2     4  2 or more
## 3     8  2 or more
## 4    -1         NA
## 5    -2         NA

We set the boundaries at 0, 1, 2, and keep the edges open by setting the first and last boundary to infinity. We give the intervals explicit labels, due to limited functionality of np.nan (there are no string type missing values), we include explicit missing value “NA” into the string. Finally, we tell cut to exclude the right boundary value from the interval. Otherwise the value 0 would fall into category “NA”, value 1 into “0”, and so on.

Exercise 6.2 Load the males dataset. Use pd.cut to convert variable school (years of schooling) into categories “less than HS” (less than 12), “HS” (12), “Some college” (13-15) and “College” (16).

See the solution

6.2.1.2 Using np.where

Another option is to use np.where, a vectorized version of if-else statement. np.where takes three arguments, a vectorized logical condition, value if the condition is true, and value if it is false. So we can transform the data into the desired form through two steps, one step for each transformation:

# replace all >2 kids with "2"
wvs["kids2"] = np.where(wvs.kids > 2, 2, wvs.kids)
# replace all negative values with NaN
wvs["kids2"] = np.where(wvs.kids < 0, np.nan, wvs.kids2)
wvs
##    kids      kids1  kids2
## 0     0          0    0.0
## 1     2  2 or more    2.0
## 2     4  2 or more    2.0
## 3     8  2 or more    2.0
## 4    -1         NA    NaN
## 5    -2         NA    NaN

As we can see, all values for “kids” above 2 are converted to 2, and all negative values are converted to missings. (Note that as numpy does not have integer missing values, the kids2 variable is a float.) This solution may be sufficient in many cases. However, we may also want to convert the categories to text as we cannot encode “2 or more” as a number. This can be achieved by just adding .astype(str) conversion method into the np.where:

# replace, and convert to string
wvs["kids2"] = np.where(wvs.kids >= 2, "2 or more", wvs.kids.astype(str))
wvs["kids2"] = np.where(wvs.kids < 0, np.nan, wvs.kids2)
wvs
##    kids      kids1      kids2
## 0     0          0          0
## 1     2  2 or more  2 or more
## 2     4  2 or more  2 or more
## 3     8  2 or more  2 or more
## 4    -1         NA        NaN
## 5    -2         NA        NaN

Note that as there are no string-type missings, the “kids2” column is now object, a mixed type.

6.2.1.3 Replacing selected values in dataframe

The final approach we discuss here is to modify selected observations in the dataframe based on logical condition. We create a new empty variable that is a copy of the existing one, and thereafter modify it according to our needs. Note that this needs a mixed indexing approach, we will index the rows by a logical vector (logical condition) and columns with its name. This can be achieved with .loc:

wvs["kids4"] = wvs.kids.astype(str)
wvs.loc[wvs.kids >= 2, "kids4"] = "2 or more"
wvs.loc[wvs.kids < 0, "kids4"] = np.nan
wvs
##    kids      kids1      kids2      kids4
## 0     0          0          0          0
## 1     2  2 or more  2 or more  2 or more
## 2     4  2 or more  2 or more  2 or more
## 3     8  2 or more  2 or more  2 or more
## 4    -1         NA        NaN        NaN
## 5    -2         NA        NaN        NaN

6.2.2 Converting categorical variables to dummies

In the previous section we looked at converting variables to categoricals, or converting one set of categories to another. But in many cases (e.g. when inserting categorical variables to statistical models) we need to do the opposite–convert categoricals to vectors of numbers. Here we demonstrate this by using pd.get_dummies to convert categories to dummies.

Imagine we work on income data (dataset males, dataset Males in R package Ecdat):

males = pd.read_csv("../data/males.csv.bz2", sep="\t")\
    [["wage", "ethn", "maried"]].sample(10)
males
##           wage   ethn maried
## 456   2.236457  other     no
## 3349  0.984302  black    yes
## 2032  1.459484  other     no
## 973   1.943643  other    yes
## 2349  1.432757  other    yes
## 4121  1.634756  other     no
## 498   1.161928  other     no
## 1351  1.381506  black     no
## 4065  1.053912  other    yes
## 3830  0.898706   hisp    yes

The subset of 10 cases contains three variables: wage (log hourly wage in 1980s), ethn (race: black, hisp, other) and maried (marital status, yes or no). Our task is to combine all these variables into a model. As log wage is already numeric, we do not have to do anything with it. But both ethn and maried are categoricals, so we convert those to dummies first.

The function pd.get_dummies can convert a series into data frames of dummies. In its simplest form, we can use it like this:

pd.get_dummies(males.maried)
##       no  yes
## 456    1    0
## 3349   0    1
## 2032   1    0
## 973    0    1
## 2349   0    1
## 4121   1    0
## 498    1    0
## 1351   1    0
## 4065   0    1
## 3830   0    1

It returns a data frame of dummies where the column names correspond to the names of categories. Note also that it preserves the original index, the first column in the example table.

One can specify a name prefix if needed, e.g. we can do

race = pd.get_dummies(males.ethn, prefix="race")

Finally, in case of a number of statistical models we want to include all dummies but one–the reference category. One can drop it manually, or alternatively use the drop_first option:

race.drop("race_other", axis=1)  # 'other' as reference
##       race_black  race_hisp
## 456            0          0
## 3349           1          0
## 2032           0          0
## 973            0          0
## 2349           0          0
## 4121           0          0
## 498            0          0
## 1351           1          0
## 4065           0          0
## 3830           0          1
pd.get_dummies(males.ethn, prefix="race", drop_first=True)
# 'race_black' as reference category
##       race_hisp  race_other
## 456           0           1
## 3349          0           0
## 2032          0           1
## 973           0           1
## 2349          0           1
## 4121          0           1
## 498           0           1
## 1351          0           0
## 4065          0           1
## 3830          1           0

These dataframes of dummies can now combined into the design matrix by pd.concat.

You can also convert a data frame to dummies, in that case all non-numeric variables will be converted:

pd.get_dummies(males, drop_first=True)
##           wage  ethn_hisp  ethn_other  maried_yes
## 456   2.236457          0           1           0
## 3349  0.984302          0           0           1
## 2032  1.459484          0           1           0
## 973   1.943643          0           1           1
## 2349  1.432757          0           1           1
## 4121  1.634756          0           1           0
## 498   1.161928          0           1           0
## 1351  1.381506          0           0           0
## 4065  1.053912          0           1           1
## 3830  0.898706          1           0           1

Note that the name of the original variable is used as the prefix, and the result is ready to be used as design matrix for statistical models.

Exercise 6.3 Re-load the males dataset and preserve variable residence. What categories does it contain? Convert residence to dummies labeled like Rsouth where you make south to the reference category.

Hint: consult the documentation for pd.cut to figure out how to create such column names.

See the solution

pd.get_dummies can also convert numeric variables to dummies. Imagine a dataset

df = pd.DataFrame({"name":["Cao Cao", "Guan Yu", "Yuan Shao", "Yan Liang"],
                   "education":[1, 3, 2, 2],
                   "role":["commander", "hero", "commander", "commander"]
})
df
##         name  education       role
## 0    Cao Cao          1  commander
## 1    Guan Yu          3       hero
## 2  Yuan Shao          2  commander
## 3  Yan Liang          2  commander

Here education is coded as 1: less than high school, 2: high school, 3: college. We can re-code just this variable to dummies as

pd.get_dummies(df.education)
##    1  2  3
## 0  1  0  0
## 1  0  0  1
## 2  0  1  0
## 3  0  1  0

and later combine with other dummies (see Section Combining data into data frames). Alternatively, we can also explain pd.get_dummies() that we want this variable too to be converted into dummies:

pd.get_dummies(df, columns=["education", "role"])
##         name  education_1  education_2  education_3  role_commander  role_hero
## 0    Cao Cao            1            0            0               1          0
## 1    Guan Yu            0            0            1               0          1
## 2  Yuan Shao            0            1            0               1          0
## 3  Yan Liang            0            1            0               1          0

Exercise 6.4 Load Titanic data.

  • Convert age into three categories: 0-13, 14-50, 50- (use pd.get_cut).
  • Convert the age categories you just created, sex, and pclass to dummies. Note: pclass is coded as number!

See the solution

6.3 Combining data into data frames

A common task is to create a data frame from series, add new columns to a data frame, and combine data frames into a wider data frame. Some of the tasks can be achieved by either creating a new variable, or alternatively by creating a data frame from existing series. However, this method fails if you want to combine data frames instead of single series. Below we discuss pd.concat, a tool that is explicitly done with this task in mind.

6.3.1 Concatenating data with pd.concat

pd.concat can combine series and data frames in multiple ways. Let us demonstrate this through a few examples.

First we create three series, for country, capital, and population:

countries = pd.Series(["MY", "ID", "KH"])
capitals = pd.Series(["Kuala Lumpur", "Jakarta", "Phnom Penh"])
populations = pd.Series([32.7, 267.7, 15.3])

We can combine all three series into a data frame

pd.concat((countries, capitals, populations), axis=1)
##     0             1      2
## 0  MY  Kuala Lumpur   32.7
## 1  ID       Jakarta  267.7
## 2  KH    Phnom Penh   15.3

This example demonstrates the most important arguments of pd.concat. The first argument is a tuple of data that we want to concatenate (you can also use a list of data), in this case it is a three-tuple of all three series–we can merge as many series and data frames as we wish in one function call. We also want to set axis=1 to specify that we want to concatenate the series column-wise, not underneath each other.

We can also concatenate data frames. For instance, here we first concatenate the first two series into a data frame, and thereafter add an additional column to it. We use pd.concat for both steps:

df = pd.concat((countries, capitals), axis=1)
df
##     0             1
## 0  MY  Kuala Lumpur
## 1  ID       Jakarta
## 2  KH    Phnom Penh
pd.concat((df, populations), axis=1)
##     0             1      0
## 0  MY  Kuala Lumpur   32.7
## 1  ID       Jakarta  267.7
## 2  KH    Phnom Penh   15.3

If the data frames are compatible then we can also combine them row-wise, underneath each other by specifying axis=0.

It is important to realize that when concatenating data horizontally, pd.concat does not combine observations by row but by index. For instance, if we modify our data and pick different indices for countries and for capitals, then countries will not be aligned with capitals:

countries = pd.Series(["MY", "ID", "KH"], index=range(3))
capitals = pd.Series(["Kuala Lumpur", "Jakarta", "Phnom Penh"],
                     index=["MY", "ID", "KH"])
pd.concat((countries, capitals), axis=1)
##       0             1
## 0    MY           NaN
## 1    ID           NaN
## 2    KH           NaN
## ID  NaN       Jakarta
## KH  NaN    Phnom Penh
## MY  NaN  Kuala Lumpur

As the capitals series does not have elements with index 0, 1 and 2, those capitals will be missing. In a similar fashion, the countries for indices MY, ID and KH are missing and hence we have missing country codes for these capitals. Never mind that these index values correspond to the country names of the countries series. The index is visible as the first column of the dataframe display and indicates clearly what is the problem.

A simple solution to get such data aligned is to use reset_index method. This just replaces the former index by a new one, an integer sequence from 0 to the count of cases minus one:

pd.concat((countries, capitals.reset_index()), axis=1)
##     0 index             0
## 0  MY    MY  Kuala Lumpur
## 1  ID    ID       Jakarta
## 2  KH    KH    Phnom Penh

capitals.reset_index() forces the index of capitals to be 0, 1, 2, just like that for countries. Obviously, if you remove the original index then you have to ensure the observations in both data frames are in correct order.

Obviously, there are other ways to achieve the same result. You can use reindex method to re-index the other series not by consecutive numbers but the index of the first series. A more general way is to use merge that does not require the order of data to be the same in each series.

Exercise 6.5 Load the males dataset and preserve variables wage, residence, ethn.

  • Convert just variable residence to dummies with “North East” as the reference category.
  • Convert ethn to dummies with “other” as the reference category
  • Combine wage, and the two sets of dummies you created into a new data frame using pd.concat.

See the solution