Chapter 6 Cleaning and Manipulating Data
This section explains and demonstrates certain data cleaning and preparation tasks using pandas. The task here is mostly to introduce you to various useful functions and show how to solve common task. We do not talk much about any fundamental data processing problem.
We use the following libraries:
## ModuleNotFoundError: No module named 'pandas'
## ModuleNotFoundError: No module named 'matplotlib'
6.1 Missing observations
In most datasets we encounter some values are missing. Such missing values, often referred to as just missings, will destroy all computations we do with data. Obviously, whatever we want to achieve, we cannot do this if the data is not there. We have to learn to handle missings and live with missings.
6.1.1 How is missing data coded
Python has a dedicated data type for missing–np.nan. Internally,
this corresponds to IEEE not-a-number code, and if a nan enters
computations, the result will in most cases be missing too. Note the
two implications of python having just a single type of missing value:
- As
np.nanis numeric, a series with missing value must be numeric, or of generic type object. For instance, when adding a missing value to a logical series, the result will be of type _object. - In languages, such as R, there where there are both “not-a-number” (NaN) and “not available” (NA) values, NaN is typically used to denote invalid mathematical results (such as log of negative number), and NA is used to denote for missing data. If such a distinction is needed, one can add a separate variable and encode the type and reason for missingness there.
But not all missing values are coded using np.nan. For instance, in
social sciences, it is customary to denote missings as negative
numbers. World Value Survey codes all missing responses as -1: don’t
know; -2: no answer; -3: not applicable; -4 not asked in survey; -5:
missing, unknown. Other surveys may use positive out-of-range values,
e.g. missing income may be coded as 90000 or 99999 or something
similar. It is important to be aware of the coding of missings in the
current data.
In practice, one also encounters missings that are not specifically coded. For instance, one may see age “200”. This is clearly wrong (as long as we talk about human age in years), and the documentation may be quiet about such values. We typically do not know why such values are there. Perhaps it is typo in “20”, or perhaps it is some sort of undocumented coding.
Below we walk through the main tools in pandas and numpy that help to
identify, remove, or replace missing values. However, as the
dedicated tools only work with np.nan codes, we also give examples
about how to handle custom codes and data entry errors.