Chapter 6 Cleaning and Manipulating Data

This section explains and demonstrates certain data cleaning and preparation tasks using pandas. The task here is mostly to introduce you to various useful functions and show how to solve common task. We do not talk much about any fundamental data processing problem.

We use the following libraries:

import numpy as np
import pandas as pd
## ModuleNotFoundError: No module named 'pandas'
import matplotlib.pyplot as plt
## ModuleNotFoundError: No module named 'matplotlib'

6.1 Missing observations

In most datasets we encounter some values are missing. Such missing values, often referred to as just missings, will destroy all computations we do with data. Obviously, whatever we want to achieve, we cannot do this if the data is not there. We have to learn to handle missings and live with missings.

6.1.1 How is missing data coded

Python has a dedicated data type for missing–np.nan. Internally, this corresponds to IEEE not-a-number code, and if a nan enters computations, the result will in most cases be missing too. Note the two implications of python having just a single type of missing value:

  • As np.nan is numeric, a series with missing value must be numeric, or of generic type object. For instance, when adding a missing value to a logical series, the result will be of type _object.
  • In languages, such as R, there where there are both “not-a-number” (NaN) and “not available” (NA) values, NaN is typically used to denote invalid mathematical results (such as log of negative number), and NA is used to denote for missing data. If such a distinction is needed, one can add a separate variable and encode the type and reason for missingness there.

But not all missing values are coded using np.nan. For instance, in social sciences, it is customary to denote missings as negative numbers. World Value Survey codes all missing responses as -1: don’t know; -2: no answer; -3: not applicable; -4 not asked in survey; -5: missing, unknown. Other surveys may use positive out-of-range values, e.g. missing income may be coded as 90000 or 99999 or something similar. It is important to be aware of the coding of missings in the current data.

In practice, one also encounters missings that are not specifically coded. For instance, one may see age “200”. This is clearly wrong (as long as we talk about human age in years), and the documentation may be quiet about such values. We typically do not know why such values are there. Perhaps it is typo in “20”, or perhaps it is some sort of undocumented coding.

Below we walk through the main tools in pandas and numpy that help to identify, remove, or replace missing values. However, as the dedicated tools only work with np.nan codes, we also give examples about how to handle custom codes and data entry errors.

6.1.2 Removing missing observations

6.1.2.1 Handling np.nan-s

We demonstrate working with missings using Titanic data:

titanic = pd.read_csv("../data/titanic.csv.bz2")
## NameError: name 'pd' is not defined
titanic.shape
## NameError: name 'titanic' is not defined