Chapter 5 Descriptive Analysis with Pandas

Here we focus the basic analysis, concerned with evaluating data quality and and missing values. For descriptive statistics, see Section Descriptive Statistics.

We assume you have loaded pandas and numpy as

import numpy as np
import pandas as pd

## /home/otoomet/R/x86_64-pc-linux-gnu-library/4.4/reticulate/python/rpytools/loader.py:120: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
##   return _find_and_load(name, import_)

5.1 First steps: know your data

Next, we do some examples of exploratory data analysis with pandas. Let’s load the Titanic data here as this allows to demonstrate more features. Note that pd.read_csv allows to read compressed data formats directly, there is no need to decompress those.

titanic = pd.read_csv("../data/titanic.csv.bz2")

We read the compressed titanic data file directly into pandas. As this is comma-separated, we do not have to specify a separator as comma is the default one.

As a first step, we may query the size of the dataset. Data frames have .shape attribute, exactly as numpy arrays:

titanic.shape

## (1309, 14)

Shape tells us that the dataset contains 1309 rows and 14 columns.

Next, we may want to take a quick look at a few rows of the data. .head() and .tail() attributes are useful for looking at a few first and last lines, .sample() extract a few random rows:

titanic.head()

##    pclass  survived  ...   body                        home.dest
## 0       1         1  ...    NaN                     St Louis, MO
## 1       1         1  ...    NaN  Montreal, PQ / Chesterville, ON
## 2       1         0  ...    NaN  Montreal, PQ / Chesterville, ON
## 3       1         0  ...  135.0  Montreal, PQ / Chesterville, ON
## 4       1         0  ...    NaN  Montreal, PQ / Chesterville, ON
## 
## [5 rows x 14 columns]

titanic.tail(3)

##       pclass  survived                       name   sex   age  ...  cabin  embarked boat   body home.dest
## 1306       3         0  Zakarian, Mr. Mapriededer  male  26.5  ...    NaN         C  NaN  304.0       NaN
## 1307       3         0        Zakarian, Mr. Ortin  male  27.0  ...    NaN         C  NaN    NaN       NaN
## 1308       3         0         Zimmerman, Mr. Leo  male  29.0  ...    NaN         S  NaN    NaN       NaN
## 
## [3 rows x 14 columns]

titanic.sample(4)

##       pclass  survived                        name  ... boat  body               home.dest
## 1025       3         1         Moor, Master. Meier  ...   14   NaN                     NaN
## 336        2         1           Beane, Mr. Edward  ...   13   NaN  Norwich / New York, NY
## 1077       3         1   O'Driscoll, Miss. Bridget  ...    D   NaN                     NaN
## 75         1         0  Colley, Mr. Edward Pomeroy  ...  NaN   NaN            Victoria, BC
## 
## [4 rows x 14 columns]

Exercise 5.1 Extract the 1000th row from the data. What is the name of the person? Did (s)he survive? See the solution

We may also want to know all the variable names in data:

titanic.columns

## Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
##        'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
##       dtype='object')

The attribute .columns lists the variable names but it is in a form of a special Index data structure. This may be enough for e.g. looping over variables, but we may also convert this to a list as

list(titanic.columns)

## ['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']

Now is time to consult the documentation if it exists. Below we work with the following variables

pclass: passenger class (1, 2, 3)
survived: whether the passenger survived (1) or not (0)
sex:
age:
boat: boat number where the passenger was rescued.

We also check if the actual coding corresponds to the documentation.

Next, let us check the data types of the data. Although some, e.g. pclass (passenger class) and survived look numeric, they may actually be strings. This can be queried by .dtypes attribute:

titanic.dtypes

## pclass         int64
## survived       int64
## name          object
## sex           object
## age          float64
## sibsp          int64
## parch          int64
## ticket        object
## fare         float64
## cabin         object
## embarked      object
## boat          object
## body         float64
## home.dest     object
## dtype: object

The data types tell that both pclass and survived are integers while name and sex are “objects”. Here it means strings, in principle they may also be more complex objects.

5.2 What are the values?

One of the first steps to do with a new dataset is to check what are the values of the relevant variables. Pandas library contains a lot of tools for descriptive data analysis. For the categorical variables we usually want to see the explicit values, for the numeric ones we may check minimum and maximum values.

5.2.1 Different discrete values

What are the possible values for sex? The quick look at data tells these are male and female, but are there any more? We can query that with .unique() method:

titanic.sex.unique()

## array(['female', 'male'], dtype=object)

The result tells us that all passengers in this data are categorized as “male” or “female”. This looks perfectly plausible. We can also see a technical detail, namely that the result is returned as a numpy array.

A word about methods and method chaining. Let’s revisit the previous command, titanic.sex.unique(). This contains three components, and the process is applying these in a sequential order:

titanic is the dataset. This is what we start with.
.sex is an attribute of titanic, here it means to extract the sex variable from the dataset. It results in a series.
Finally, .unique() is a method, it takes whatever is at it’s left side (here the series sex), and works with this one (here returns an array of unique values).

Some of the methods can be applied to either the whole data frames or to individual variables. If applied to the whole data frame, then they apply apply to every single variable separately, pretty much as if looping over individual variables and applying it to each of them.

But what about boats? (“boat” means in which lifeboat was the passenger found)

titanic.boat.unique()

## array(['2', '11', nan, '3', '10', 'D', '4', '9', '6', 'B', '8', 'A', '5',
##        '7', 'C', '14', '5 9', '13', '1', '15', '5 7', '8 10', '12', '16',
##        '13 15 B', 'C D', '15 16', '13 15'], dtype=object)

This list gives a much more complex picture. There are a plethora of boat id-s, including numbers and letters. For a number of cases the data also contains multiple boat id-s, perhaps because the rescuers out on the sea had more important things to do than to record the boat numbers. And nan tells us that not everyone was fortunate enough to get to a boat.

We can also easily query the number of unique values (using .nunique method):

titanic.boat.nunique()

## 27

We can see that there are 27 different boat codes. But note that this is not the same as the number of distinct values returned by .unique:

len(titanic.boat.unique())

## 28

The discrepancy is there because .unique also includes the missings, but .nunique does not count missings.

We can get a frequency table by .value_counts():

titanic.sex.value_counts()

## sex
## male      843
## female    466
## Name: count, dtype: int64

We see that there are almost twice as many males than females in the data. This is plausible as well. (See also Section 7.1.)

5.2.2 Continuous values

For continuous variables, we do not want to count and print individual values. Instead, we may choose to just check the minimum and maximum value. This gives a quick overview if the values are feasible, e.g. do we see negative age, or implausibly large age values:

titanic.age.min(), titanic.age.max()

## (0.1667, 80.0)

We can see that the youngest passenger was 0.167 years (= 2 months) and the oldest passenger was 80 years old. Both of these values are plausible and hence we are not worried about any too large or too small values. See Section 6.1 below for how to detect and analyze missing values.

Another common way to get an idea about a numeric variable is to compute average of it. Let us compute mean survival rate. Remember: survived = 1 means someone survived, and survived = 0 means the person went down with Titanic. We can compute average with the .mean() method:

titanic.survived.mean()

## 0.3819709702062643

We see that less than 40% of passengers in this dataset survived.

This figure may not be what you want because of missing values. See Section 6.1.4 below for explanations.

See Section 6.1 below for how to count missing values().

Exercise 5.2 Consider methods .min, .mean, .unique and .nunique. Which of these can be applied to the whole data frame? See the tip in Section 5.2.1.

See the solution

5.2.3 Counts and proportions

Another common task we face is to count, or to compute proportions, of certain values, such as number of women, or percent of children in data. The .value_counts method (see Section 5.2.1 above) can be used to get counts of discrete values, e.g. the number of women. But you cannot easily use .value_counts to compute number of children–there are no such category as “children”, and hence there are no values to count. Instead, we should create a logical vector that describes the condition, and sum it. Let’s find the number of children on the ship using this approach.

Assume children are passengers who are less than 14 years old. Counting can be done explicitly as:

## create a logical variable 'children'
children = titanic.age < 14
## count children
children.sum()

## 99

So there was 99 children on Titanic. Let’s explain these steps in a more detail. First, the line

children = titanic.age < 14

creates a logical variable children, containing Trues and Falses, telling for each passenger whether they were a child or not. A sample of this variable looks like

children.head(5)

## 0    False
## 1     True
## 2     True
## 3    False
## 4    False
## Name: age, dtype: bool

So the second and third passenger were children, while the others in the sample here were not. Next, children.sum() adds these values together–remember, as soon as we start doing math with logical values, Trues will be converted to “1” and False to “0”, and the latter will not matter when adding. So we are as well as counting children here. In practice, you do not need to create a separate variable,

(titanic.age < 14).sum()

## 99

will usually suffice.

Computing proportions can be done in a similar way. Just instead of using .sum, we’ll use .mean():

(titanic.age < 14).mean()

## 0.07563025210084033

Hence 7.56 percent of the passengers were children.

This figure may not be what you mean as proportion of children because of missing values. See Section 6.1.4 below for explanations.

What happens here is similar to counting: when computing proportion, we need to divide the count by the total number of cases. You can do it explicitly, but .mean() is easier, and it is also easier to read.

5.3 Filtering data

Finally, let’s get the subset of data that were in either “A” or “B” boat. This can be done using .isin() method to check if a string is in a given list:

ab = titanic[titanic.boat.isin(["A", "B"])]
ab.shape

## (20, 14)

ab.head(10)

##      pclass  survived  ... body                         home.dest
## 14        1         1  ...  NaN                     Hessle, Yorks
## 19        1         0  ...  NaN                      Winnipeg, MN
## 136       1         1  ...  NaN                    Washington, DC
## 235       1         1  ...  NaN             Paris /  New York, NY
## 295       1         1  ...  NaN                     Haverford, PA
## 317       1         1  ...  NaN  Geneva, Switzerland / Radnor, PA
## 503       2         1  ...  NaN                   Chelsea, London
## 603       3         1  ...  NaN               East Providence, RI
## 605       3         1  ...  NaN                Perkins County, SD
## 630       3         1  ...  NaN                               NaN
## 
## [10 rows x 14 columns]

One can see that these two boats together only took 20 passengers…

5.4 Selecting variables

Typical datasets we encounter contain variables that are irrelevant for the current analysis. In such case we may want to either select only the relevant ones (if there are only few such variables), or remove the irrelevant ones (if there are not too many of those). Removing irrelevant information helps us to see data better when printing it, to understand it better, and hence to have fewer coding errors. It also helps with certain transformations where we now may want to operate on the whole narrowed-down data frame instead of selected variables only. Finally, it also helps to conserve memory and speed up processing.

5.4.1 Selecting desired variables

We discussed how to select the desired variables in Section 3.3.1 but we briefly review it here too. You can select only the desired variables with dataframe[["var1", "var2", ..."]]. For instance, let’s focus on only sex, pclass, and survived:

titanic[["sex", "pclass", "survived"]].sample(4)

##         sex  pclass  survived
## 943  female       3         0
## 467  female       2         1
## 573  female       2         1
## 916  female       3         1

Note that the columns are returned in the order they are listed in the selection, not in the original order.

Instead of loading the complete dataset and selecting the desired variables later, we can also specify which columns to read with pd.read_csv:

pd.read_csv("../data/titanic.csv.bz2",
            usecols=["sex", "pclass", "survived"]).sample(4)

##      pclass  survived     sex
## 929       3         0    male
## 947       3         1  female
## 561       2         0    male
## 184       1         0    male

This achieves a similar result, although the columns now are in the original order, not in the specified order.

5.4.2 Removing undesired variables

Alternatively, if we want to keep most of the variables then we may instead specify which ones to remove. This can be done with .drop method. Below we drop a large number of variables:

titanic.drop(["name", "sibsp", "boat", "parch", "cabin", "embarked",
              "home.dest", "ticket", "body"],
             axis=1).sample(4)

##       pclass  survived     sex   age      fare
## 1076       3         0  female   NaN    7.7500
## 299        1         0    male  40.0   27.7208
## 520        2         1    male  20.0   13.8625
## 0          1         1  female  29.0  211.3375

Note that we need to tell .drop that we are talking about removing columns with these names–not about rows with this index. This is what the argument axis=1 does. Otherwise we get an error:

titanic.drop(["name", "sibsp", "boat", "parch", "embarked",
              "home.dest"]).sample(4)

## KeyError: "['name', 'sibsp', 'boat', 'parch', 'embarked', 'home.dest'] not found in axis"

In particular, this means that .drop will cause an error if we want to remove an non-existent column. This may cause problems when overwriting data in notebooks: in the first pass you remove a variable, and when you re-run the same cell again, the variable is gone and you get an error. As always, either rename data, avoid re-running the data cleaning step, or incorporate error handling.

Exercise 5.3 You want to plot histogram of age distribution for Titanic passengers, separate for males and females.

What are the relevant variable names?
How will you select the relevant columns?

See the solution

5.5 Grouped operations

As other major dataframe libraries, pandas supports grouped operations. Grouped operations are performed essentially like this: first, data is split into groups by the grouping variable values. Second, the following operations are performed on all groups individually, independent of the other groups. And finally, all the results are put together again, together with group indicators.

Grouping in pandas can be done by .groupby method. It expects a list of grouping variables. We demonstrate this by computing the survival rate by gender (gender is in variable sex and survival is survived). By itself, .groupy just creates a grouped data frame that unfortunately does not have any printing method:

titanic.groupby("sex")

## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7555b805f380>

But we can use the grouped dataframe for other operations, in this case we compute mean of survived:

titanic.groupby("sex").survived.mean()

## sex
## female    0.727468
## male      0.190985
## Name: survived, dtype: float64

The result is a series with two values (numeric survival rate) with index being the group values, female and _male. In this example we see that female survival rate was 72% while only 19% men survived the disaster.

We can also group by more than one variable. In that case we have to supply those as a list, e.g. in order to compute the survival rate by gender and passencer class (variable pclass) we can do

titanic.groupby(["sex", "pclass"]).survived.mean()

## sex     pclass
## female  1         0.965278
##         2         0.886792
##         3         0.490741
## male    1         0.340782
##         2         0.146199
##         3         0.152130
## Name: survived, dtype: float64

We see that passenger class was also a very important determinant of survival.

.groupby supports more options, e.g. one can keep the group indicators as variables instead of index.

5.6 String operations

Pandas opens string variables to a large list of string functions using .str attribute. These largely replicate the re module but the syntax is different, and often the function names are different too. We do walk through a number of examples here, namely

str.contains to search patterns in strings
str.match to find if string begins with a given pattern
str.replace to replace parts of strings

Many of the strings functions take advantage of regular expressions, so it is useful to have an idea of regular expression basics.

Let’s use Titanic data and analyze whether there is a regular pattern of cabin codes and passenger class. First, what are the existing cabin numbers?

titanic.cabin.unique()

## array(['B5', 'C22 C26', 'E12', 'D7', 'A36', 'C101', nan, 'C62 C64', 'B35',
##        'A23', 'B58 B60', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C99',
##        'C52', 'T', 'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B4',
##        'E40', 'B38', 'E24', 'B51 B53 B55', 'B96 B98', 'C46', 'E31', 'E8',
##        'B61', 'B77', 'A9', 'C89', 'A14', 'E58', 'E49', 'E52', 'E45',
##        'B22', 'B26', 'C85', 'E17', 'B71', 'B20', 'A34', 'C86', 'A16',
##        'A20', 'A18', 'C54', 'C45', 'D20', 'A29', 'C95', 'E25', 'C111',
##        'C23 C25 C27', 'E36', 'D34', 'D40', 'B39', 'B41', 'B102', 'C123',
##        'E63', 'C130', 'B86', 'C92', 'A5', 'C51', 'B42', 'C91', 'C125',
##        'D10 D12', 'B82 B84', 'E50', 'D33', 'C83', 'B94', 'D49', 'D45',
##        'B69', 'B11', 'E46', 'C39', 'B18', 'D11', 'C93', 'B28', 'C49',
##        'B52 B54 B56', 'E60', 'C132', 'B37', 'D21', 'D19', 'C124', 'D17',
##        'B101', 'D28', 'D6', 'D9', 'B80', 'C106', 'B79', 'C47', 'D30',
##        'C90', 'E38', 'C78', 'C30', 'C118', 'D36', 'D48', 'D47', 'C105',
##        'B36', 'B30', 'D43', 'B24', 'C2', 'C65', 'B73', 'C104', 'C110',
##        'C50', 'B3', 'A24', 'A32', 'A11', 'A10', 'B57 B59 B63 B66', 'C28',
##        'E44', 'A26', 'A6', 'A7', 'C31', 'A19', 'B45', 'E34', 'B78', 'B50',
##        'C87', 'C116', 'C55 C57', 'D50', 'E68', 'E67', 'C126', 'C68',
##        'C70', 'C53', 'B19', 'D46', 'D37', 'D26', 'C32', 'C80', 'C82',
##        'C128', 'E39 E41', 'D', 'F4', 'D56', 'F33', 'E101', 'E77', 'F2',
##        'D38', 'F', 'F G63', 'F E57', 'F E46', 'F G73', 'E121', 'F E69',
##        'E10', 'G6', 'F38'], dtype=object)

We can see that cabin code is typically a letter, followed by two or three digits. We can guess that the letter denotes the deck, and number is the cabin number in that deck. In several cases however there are apparently certain data entry errors, manifested by just a single letter code, or multiple codes for a single person. (Although passengers may also have booked more than one cabin.)

Assume everyone only has a single cabin. We’ll assign a single cabin code to each passenger. This will be the existing single cabin if coded in that way. If the value of cabin contains multiple codes, then we take the first valid code (i.e. the one in the form of letter + digits).

5.6.1 Find patterns in strings

As the first step, we find the problematic cabin codes. In most cases, these contain a space, there are also a single-letter codes which are presumably wrong. We can identify patterns in strings using str.contains(pattern). This function returns a vector of trues/falses, depending on whether the original string vector contains the pattern:

titanic.cabin.str.contains(" ", na=False, regex=False).head(6)

## 0    False
## 1     True
## 2     True
## 3     True
## 4     True
## 5    False
## Name: cabin, dtype: bool

argument na tells what to do with missing values, here we force them to return false, regex=False means that the pattern we supply is not a regular expression. Plain patterns are simpler and faster to use than regular expressions, but the latter are much more powerful. We see that neither first nor sixth cabin contain a space, but the second through fifth cases contain. This can be confirmed by looking at the cabin values:

titanic.cabin.head(6)

## 0         B5
## 1    C22 C26
## 2    C22 C26
## 3    C22 C26
## 4    C22 C26
## 5        E12
## Name: cabin, dtype: object

We can also check how many such problematic cases do we have:

titanic.cabin.str.contains(" ", na=False, regex=False).value_counts()

## cabin
## False    1268
## True       41
## Name: count, dtype: int64

We can see we have 41 incorrecly filled cabin numbers.

Note that:

first, you have to use .str attribute to open a column up for string operations. .cabin.contains() does not work.
.contains returns a logical value, or NA in case of missing entry. If we want to use this for indexing later, we want to exclude the missing entries, and hence mark the missings as False (na=False).
the first argument (pat) is normally a regular expression. However, currently we can do with a simple pattern and no regular expression is needed. Hence we say regex=False.

5.6.2 Match a pattern

Next, we use a regular expression that identifies both types of errors: space in code, and missing number. Instead of looking for strings that contain a space, we are looking for strings that match the valid pattern: a single letter followed by digits. We demonstrate this using str.match, a function similar to str.contains, just it looks if the beginning of the string matches the pattern (str.contains looks for the pattern anywhere in the string). This is fortunately easy to do using regular expressions. This can be coded as r"\w\d+$":

titanic.cabin.str.match(r"\w\d+$", na=False).value_counts()

## cabin
## False    1061
## True      248
## Name: count, dtype: int64

str.match always assumes we have a regular expression, so there is no regexp argument. The “r” in front of the expression means it is a “raw” string, i.e. the backslashes in the string are interpreted literally, not as special codes. Inside of the expression we have \w for a letter (word character), followed by \d+–one or more digits. Final $ means that the string must end here, i.e. it must not contain extra spaces or other symbols (such as another cabin code).

The result shows that now we picked many more cases that do not match the pattern, 248 instead of 41 above. However, not an important difference between this approach, and the approach above: when looking for patterns that contain a space we were looking for wrong patterns, now we are looking for valid patterns. By specifying na=False above, we told pandas not to consider missings as invalid patterns. As we are only interested in invalid codes, not missings, we should count missings as correct in the example below. Here is an example of mismatched cabin codes:

i = titanic.cabin.str.match(r"\w\d+$", na=True)
titanic.cabin[~i].sample(6)

## 4          C22 C26
## 605          F G63
## 465              D
## 319        E39 E41
## 994          F E46
## 49     B51 B53 B55
## Name: cabin, dtype: object

5.6.3 Replacing strings

Finally, let’s replace the full cabin number with only the deck code. This can be done by replacing everything that follows the first letter by an empty string. str.replace takes two arguments: pattern and replacement. We specify the former as r"(\w).*", i.e. a letter, followed by an unspecified number of other symbols, and replace it with just the same letter:

(
    titanic
    .cabin
    .str
    .replace(r"(\w).*", r"\1", regex = True)
    .dropna()  # many missing codes there
    .sample(6)
)

## 994    F
## 103    C
## 279    C
## 21     D
## 140    D
## 93     A
## Name: cabin, dtype: object

The result contains valid cabin numbers (and many missings).

In practical terms, it is often useful to print the original and modifies codes side-by-side to see that the replacement was done right. We can achieve this by creating a new variable in the data frame, and printing out a sample of old and new variables (see Section @ref(concatenating-data-pd.concat) for another approach):

titanic["deck"] = titanic.cabin.str.replace(r"(\w).*", r"\1", regex = True)
titanic[["cabin", "deck"]].dropna().sample(6)

##     cabin deck
## 96    C86    C
## 220   B36    B
## 484   F33    F
## 573  E101    E
## 241   A32    A
## 18    D15    D

We can see that the deck codes (the ones that are not missing) are done correctly.