Chapter 5 Descriptive Analysis with Pandas
Here we focus the basic analysis, concerned with evaluating data quality and and missing values. For descriptive statistics, see Section Descriptive Statistics.
We assume you have loaded pandas and numpy as
import numpy as np
import pandas as pd
5.1 First steps: know your data
Next, we do some examples of exploratory data analysis with pandas.
Let’s load the Titanic data here as this allows to demonstrate more
features. Note that pd.read_csv
allows to read compressed data
formats directly, there is no need to decompress those.
= pd.read_csv("../data/titanic.csv.bz2") titanic
We read the compressed titanic data file directly into pandas. As this is comma-separated, we do not have to specify a separator as comma is the default one.
As a first step, we may query the size of the dataset. Data frames
have .shape
attribute, exactly as numpy arrays:
titanic.shape
## (1309, 14)
Shape tells us that the dataset contains 1309 rows and 14 columns.
Next, we may want to take a quick look at a few rows of the data.
.head()
and .tail()
attributes are useful for looking at a few
first and last lines, .sample()
extract a few random rows:
titanic.head()
## pclass survived name ... boat body home.dest
## 0 1 1 Allen, Miss. Elisabeth Walton ... 2 NaN St Louis, MO
## 1 1 1 Allison, Master. Hudson Trevor ... 11 NaN Montreal, PQ / Chesterville, ON
## 2 1 0 Allison, Miss. Helen Loraine ... NaN NaN Montreal, PQ / Chesterville, ON
## 3 1 0 Allison, Mr. Hudson Joshua Creighton ... NaN 135.0 Montreal, PQ / Chesterville, ON
## 4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) ... NaN NaN Montreal, PQ / Chesterville, ON
##
## [5 rows x 14 columns]
3) titanic.tail(
## pclass survived name sex age sibsp ... fare cabin embarked boat body home.dest
## 1306 3 0 Zakarian, Mr. Mapriededer male 26.5 0 ... 7.225 NaN C NaN 304.0 NaN
## 1307 3 0 Zakarian, Mr. Ortin male 27.0 0 ... 7.225 NaN C NaN NaN NaN
## 1308 3 0 Zimmerman, Mr. Leo male 29.0 0 ... 7.875 NaN S NaN NaN NaN
##
## [3 rows x 14 columns]
4) titanic.sample(
## pclass survived name sex age ... cabin embarked boat body home.dest
## 1218 3 0 Somerton, Mr. Francis William male 30.0 ... NaN S NaN NaN NaN
## 609 3 0 Adams, Mr. John male 26.0 ... NaN S NaN 103.0 Bournemouth, England
## 138 1 0 Graham, Mr. George Edward male 38.0 ... C91 S NaN 147.0 Winnipeg, MB
## 225 1 0 Payne, Mr. Vivian Ponsonby male 23.0 ... B24 S NaN NaN Montreal, PQ
##
## [4 rows x 14 columns]
Exercise 5.1 Extract the 1000th row from the data. What is the name of the person? Did (s)he survive? See the solution
We may also want to know all the variable names in data:
titanic.columns
## Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
## 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
## dtype='object')
The attribute .columns
lists the variable names but it is in a form
of a special Index data structure. This may be enough for e.g. looping over
variables, but we may also convert this to a list as
list(titanic.columns)
## ['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']
Now is time to consult the documentation if it exists. Below we work with the following variables
- pclass: passenger class (1, 2, 3)
- survived: whether the passenger survived (1) or not (0)
- sex:
- age:
- boat: boat number where the passenger was rescued.
We also check if the actual coding corresponds to the documentation.
Next, let us check the data types of the data. Although some,
e.g. pclass (passenger class) and survived look numeric, they may
actually be strings. This can be queried by .dtypes
attribute:
titanic.dtypes
## pclass int64
## survived int64
## name object
## sex object
## age float64
## sibsp int64
## parch int64
## ticket object
## fare float64
## cabin object
## embarked object
## boat object
## body float64
## home.dest object
## dtype: object
The data types tell that both pclass and survived are integers while name and sex are “objects”. Here it means strings, in principle they may also be more complex objects.
5.2 What are the values?
One of the first steps to do with a new dataset is to check what are the values of the relevant variables. Pandas library contains a lot of tools for descriptive data analysis. For the categorical variables we usually want to see the explicit values, for the numeric ones we may check minimum and maximum values.
5.2.1 Different discrete values
What are the possible values for sex? The quick look at data tells
these are male and female, but are there any more? We can query
that with .unique()
method:
titanic.sex.unique()
## array(['female', 'male'], dtype=object)
The result tells us that all passengers in this data are categorized as “male” or “female”. This looks perfectly plausible. We can also see a technical detail, namely that the result is returned as a numpy array.
A word about methods and method chaining. Let’s revisit the previous
command, titanic.sex.unique()
. This contains three components, and
the process is applying these in a sequential order:
titanic
is the dataset. This is what we start with..sex
is an attribute of titanic, here it means to extract the sex variable from the dataset. It results in a series.- Finally,
.unique()
is a method, it takes whatever is at it’s left side (here the series sex), and works with this one (here returns an array of unique values).
Some of the methods can be applied to either the whole data frames or to individual variables. If applied to the whole data frame, then they apply apply to every single variable separately, pretty much as if looping over individual variables and applying it to each of them.
But what about boats? (“boat” means in which lifeboat was the passenger found)
titanic.boat.unique()
## array(['2', '11', nan, '3', '10', 'D', '4', '9', '6', 'B', '8', 'A', '5',
## '7', 'C', '14', '5 9', '13', '1', '15', '5 7', '8 10', '12', '16',
## '13 15 B', 'C D', '15 16', '13 15'], dtype=object)
This list gives a much more complex picture. There are a plethora of
boat id-s, including numbers and letters. For a number of cases the
data also contains multiple boat id-s, perhaps because the rescuers
out on the sea
had more important things to do than to record the boat numbers. And
nan
tells us that not everyone was fortunate enough to get to a
boat.
We can also easily query the number of unique values (using .nunique
method):
titanic.boat.nunique()
## 27
We can see that there are 27 different boat codes.
But note that this is not the same as the number of distinct values
returned by .unique
:
len(titanic.boat.unique())
## 28
The discrepancy is there because .unique
also includes the missings,
but .nunique
does not count missings.
We can get a frequency table by .value_counts()
:
titanic.sex.value_counts()
## male 843
## female 466
## Name: sex, dtype: int64
We see that there are almost twice as many males than females in the data. This is plausible as well. (See also Section 7.1.)
5.2.2 Continuous values
For continuous variables, we do not want to count and print individual values. Instead, we may choose to just check the minimum and maximum value. This gives a quick overview if the values are feasible, e.g. do we see negative age, or implausibly large age values:
min(), titanic.age.max() titanic.age.
## (0.1667, 80.0)
We can see that the youngest passenger was 0.167 years (= 2 months) and the oldest passenger was 80 years old. Both of these values are plausible and hence we are not worried about any too large or too small values. See Section 6.1 below for how to detect and analyze missing values.
Another common way to get an idea about a numeric variable is to compute
average of it.
Let us compute mean survival rate.
Remember: survived = 1 means someone survived, and survived = 0
means the person went down with Titanic. We can compute average with the
.mean()
method:
titanic.survived.mean()
## 0.3819709702062643
We see that less than 40% of passengers in this dataset survived.
This figure may not be what you want because of missing values. See Section 6.1.4 below for explanations.
See Section 6.1 below for how to count missing values().
Exercise 5.2 Consider methods .min
, .mean
, .unique
and .nunique
. Which of
these can be applied to the whole data frame? See the tip in Section
5.2.1.
See the solution
5.2.3 Counts and proportions
Another common task we face is to count, or to compute proportions, of
certain values, such as number of women, or percent of children in
data. The .value_counts
method (see Section
5.2.1 above) can be used to get
counts of discrete values, e.g. the number of women. But you cannot
easily use .value_counts
to compute number of children–there are no
such category as “children”, and hence there are no values to count.
Instead, we should create a logical vector that describes the
condition, and sum it. Let’s find the number of children on the ship
using this approach.
Assume children are passengers who are less than 14 years old. Counting can be done explicitly as:
## create a logical variable 'children'
= titanic.age < 14
children ## count children
sum() children.
## 99
So there was 99 children on Titanic. Let’s explain these steps in a more detail. First, the line
= titanic.age < 14 children
creates a logical variable children
, containing Trues and Falses,
telling for each passenger whether they were a child or not. A sample
of this variable looks like
5) children.head(
## 0 False
## 1 True
## 2 True
## 3 False
## 4 False
## Name: age, dtype: bool
So the second and third passenger were children, while the others in
the sample here were not.
Next, children.sum()
adds these values together–remember, as soon
as we start doing math with logical values, Trues will be converted to
“1” and False to “0”, and the latter will not matter when adding. So
we are as well as counting children here. In practice, you do not
need to create a separate variable,
< 14).sum() (titanic.age
## 99
will usually suffice.
Computing proportions can be done in a similar way. Just instead of
using .sum
, we’ll use .mean()
:
< 14).mean() (titanic.age
## 0.07563025210084033
Hence 7.56 percent of the passengers were children.
This figure may not be what you mean as proportion of children because of missing values. See Section 6.1.4 below for explanations.
What happens here is similar to counting: when computing proportion, we
need to divide the count by the total number of cases. You can do it
explicitly, but .mean()
is easier, and it is also easier to read.
5.3 Filtering data
Finally,
let’s get the subset of data that were in either “A” or “B” boat.
This can be done using .isin()
method to check if a string is in a
given list:
= titanic[titanic.boat.isin(["A", "B"])]
ab ab.shape
## (20, 14)
10) ab.head(
## pclass survived name ... boat body home.dest
## 14 1 1 Barkworth, Mr. Algernon Henry Wilson ... B NaN Hessle, Yorks
## 19 1 0 Beattie, Mr. Thomson ... A NaN Winnipeg, MN
## 136 1 1 Gracie, Col. Archibald IV ... B NaN Washington, DC
## 235 1 1 Rheims, Mr. George Alexander Lucien ... A NaN Paris / New York, NY
## 295 1 1 Thayer, Mr. John Borland Jr ... B NaN Haverford, PA
## 317 1 1 Williams, Mr. Richard Norris II ... A NaN Geneva, Switzerland / Radnor, PA
## 503 2 1 Mellors, Mr. William John ... B NaN Chelsea, London
## 603 3 1 Abbott, Mrs. Stanton (Rosa Hunt) ... A NaN East Providence, RI
## 605 3 1 Abelseth, Mr. Olaus Jorgensen ... A NaN Perkins County, SD
## 630 3 1 Andersson, Mr. August Edvard ("Wennerstrom") ... A NaN NaN
##
## [10 rows x 14 columns]
One can see that these two boats together only took 20 passengers…
5.4 Selecting variables
Typical datasets we encounter contain variables that are irrelevant for the current analysis. In such case we may want to either select only the relevant ones (if there are only few such variables), or remove the irrelevant ones (if there are not too many of those). Removing irrelevant information helps us to see data better when printing it, to understand it better, and hence to have fewer coding errors. It also helps with certain transformations where we now may want to operate on the whole narrowed-down data frame instead of selected variables only. Finally, it also helps to conserve memory and speed up processing.
5.4.1 Selecting desired variables
We discussed how to select the desired variables in Section
3.3.1 but we briefly review it here
too. You can select only the
desired variables with dataframe[["var1", "var2", ..."]]
. For instance, let’s focus on only sex, pclass, and
survived:
"sex", "pclass", "survived"]].sample(4) titanic[[
## sex pclass survived
## 1074 male 3 0
## 556 male 2 0
## 1150 male 3 0
## 30 male 1 0
Note that the columns are returned in the order they are listed in the selection, not in the original order.
Instead of loading the complete dataset and selecting the desired
variables later, we can also specify which columns to read with
pd.read_csv
:
"../data/titanic.csv.bz2",
pd.read_csv(=["sex", "pclass", "survived"]).sample(4) usecols
## pclass survived sex
## 274 1 1 male
## 747 3 0 male
## 1185 3 0 male
## 950 3 0 male
This achieves a similar result, although the columns now are in the original order, not in the specified order.
5.4.2 Removing undesired variables
Alternatively, if we want to keep most of the variables then we may
instead specify which ones to remove. This can be done with .drop
method. Below we drop a large number of variables:
"name", "sibsp", "boat", "parch", "cabin", "embarked",
titanic.drop(["home.dest", "ticket", "body"],
=1).sample(4) axis
## pclass survived sex age fare
## 106 1 0 male NaN 221.7792
## 822 3 0 male 41.0 7.8500
## 910 3 0 male 17.0 7.1250
## 187 1 1 female 16.0 39.4000
Note that we need to tell .drop
that we are talking about removing
columns with these names–not about rows with this index. This is
what the argument axis=1
does. Otherwise we get an error:
"name", "sibsp", "boat", "parch", "embarked",
titanic.drop(["home.dest"]).sample(4)
## KeyError: "['name' 'sibsp' 'boat' 'parch' 'embarked' 'home.dest'] not found in axis"
In particular, this means that .drop
will cause an error if we want
to remove an non-existent column. This may cause problems when
overwriting data in notebooks: in the first pass you remove a
variable, and when you re-run the same cell again, the variable is
gone and you get an error. As always, either rename data, avoid
re-running the data cleaning step, or incorporate error handling.
Exercise 5.3 You want to plot histogram of age distribution for Titanic passengers, separate for males and females.
- What are the relevant variable names?
- How will you select the relevant columns?
See the solution
5.5 Grouped operations
As other major dataframe libraries, pandas supports grouped operations. Grouped operations are performed essentially like this: first, data is split into groups by the grouping variable values. Second, the following operations are performed on all groups individually, independent of the other groups. And finally, all the results are put together again, together with group indicators.
Grouping in pandas can be done by .groupby
method. It expects a
list of grouping variables. We demonstrate this by computing the
survival rate by gender (gender is in variable sex and survival is survived).
By itself, .groupy
just creates a grouped
data frame that unfortunately does not have any printing method:
"sex") titanic.groupby(
## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x75e083542ce0>
But we can use the grouped dataframe for other operations, in this case we compute mean of survived:
"sex").survived.mean() titanic.groupby(
## sex
## female 0.727468
## male 0.190985
## Name: survived, dtype: float64
The result is a series with two values (numeric survival rate) with index being the group values, female and _male. In this example we see that female survival rate was 72% while only 19% men survived the disaster.
We can also group by more than one variable. In that case we have to supply those as a list, e.g. in order to compute the survival rate by gender and passencer class (variable pclass) we can do
"sex", "pclass"]).survived.mean() titanic.groupby([
## sex pclass
## female 1 0.965278
## 2 0.886792
## 3 0.490741
## male 1 0.340782
## 2 0.146199
## 3 0.152130
## Name: survived, dtype: float64
We see that passenger class was also a very important determinant of survival.
.groupby
supports more options, e.g. one can keep the group
indicators as variables instead of index.
5.6 String operations
Pandas opens string variables to a large list of string functions
using .str
attribute. These largely replicate the re module but
the syntax is different, and often the function names are different too.
We do walk through a number of examples
here, namely
str.contains
to search patterns in stringsstr.match
to find if string begins with a given patternstr.replace
to replace parts of strings
Many of the strings functions take advantage of regular expressions, so it is useful to have an idea of regular expression basics.
Let’s use Titanic data and analyze whether there is a regular pattern of cabin codes and passenger class. First, what are the existing cabin numbers?
titanic.cabin.unique()
## array(['B5', 'C22 C26', 'E12', 'D7', 'A36', 'C101', nan, 'C62 C64', 'B35',
## 'A23', 'B58 B60', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C99',
## 'C52', 'T', 'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B4',
## 'E40', 'B38', 'E24', 'B51 B53 B55', 'B96 B98', 'C46', 'E31', 'E8',
## 'B61', 'B77', 'A9', 'C89', 'A14', 'E58', 'E49', 'E52', 'E45',
## 'B22', 'B26', 'C85', 'E17', 'B71', 'B20', 'A34', 'C86', 'A16',
## 'A20', 'A18', 'C54', 'C45', 'D20', 'A29', 'C95', 'E25', 'C111',
## 'C23 C25 C27', 'E36', 'D34', 'D40', 'B39', 'B41', 'B102', 'C123',
## 'E63', 'C130', 'B86', 'C92', 'A5', 'C51', 'B42', 'C91', 'C125',
## 'D10 D12', 'B82 B84', 'E50', 'D33', 'C83', 'B94', 'D49', 'D45',
## 'B69', 'B11', 'E46', 'C39', 'B18', 'D11', 'C93', 'B28', 'C49',
## 'B52 B54 B56', 'E60', 'C132', 'B37', 'D21', 'D19', 'C124', 'D17',
## 'B101', 'D28', 'D6', 'D9', 'B80', 'C106', 'B79', 'C47', 'D30',
## 'C90', 'E38', 'C78', 'C30', 'C118', 'D36', 'D48', 'D47', 'C105',
## 'B36', 'B30', 'D43', 'B24', 'C2', 'C65', 'B73', 'C104', 'C110',
## 'C50', 'B3', 'A24', 'A32', 'A11', 'A10', 'B57 B59 B63 B66', 'C28',
## 'E44', 'A26', 'A6', 'A7', 'C31', 'A19', 'B45', 'E34', 'B78', 'B50',
## 'C87', 'C116', 'C55 C57', 'D50', 'E68', 'E67', 'C126', 'C68',
## 'C70', 'C53', 'B19', 'D46', 'D37', 'D26', 'C32', 'C80', 'C82',
## 'C128', 'E39 E41', 'D', 'F4', 'D56', 'F33', 'E101', 'E77', 'F2',
## 'D38', 'F', 'F G63', 'F E57', 'F E46', 'F G73', 'E121', 'F E69',
## 'E10', 'G6', 'F38'], dtype=object)
We can see that cabin code is typically a letter, followed by two or three digits. We can guess that the letter denotes the deck, and number is the cabin number in that deck. In several cases however there are apparently certain data entry errors, manifested by just a single letter code, or multiple codes for a single person. (Although passengers may also have booked more than one cabin.)
Assume everyone only has a single cabin. We’ll assign a single cabin code to each passenger. This will be the existing single cabin if coded in that way. If the value of cabin contains multiple codes, then we take the first valid code (i.e. the one in the form of letter + digits).
5.6.1 Find patterns in strings
As the first step, we find the problematic cabin codes. In most cases, these
contain a space, there are also a single-letter codes which are
presumably wrong.
We can identify patterns in strings using
str.contains(pattern)
. This function returns a vector of
trues/falses, depending on whether the original string vector contains
the pattern:
str.contains(" ", na=False, regex=False).head(6) titanic.cabin.
## 0 False
## 1 True
## 2 True
## 3 True
## 4 True
## 5 False
## Name: cabin, dtype: bool
argument na
tells what to do with missing values, here we force them
to return false, regex=False
means that the pattern we supply is not
a regular expression. Plain patterns are simpler and faster to use
than regular expressions, but the latter are much more powerful.
We see that neither first nor sixth cabin contain a space, but the
second through fifth cases contain. This can be confirmed by looking
at the cabin values:
6) titanic.cabin.head(
## 0 B5
## 1 C22 C26
## 2 C22 C26
## 3 C22 C26
## 4 C22 C26
## 5 E12
## Name: cabin, dtype: object
We can also check how many such problematic cases do we have:
str.contains(" ", na=False, regex=False).value_counts() titanic.cabin.
## False 1268
## True 41
## Name: cabin, dtype: int64
We can see we have 41 incorrecly filled cabin numbers.
Note that:
- first, you have to use
.str
attribute to open a column up for string operations..cabin.contains()
does not work. .contains
returns a logical value, or NA in case of missing entry. If we want to use this for indexing later, we want to exclude the missing entries, and hence mark the missings as False (na=False
).- the first argument (pat) is normally a regular expression.
However, currently we can do with a simple pattern and no regular
expression is needed. Hence we say
regex=False
.
5.6.2 Match a pattern
Next, we use a regular expression that identifies both types of
errors: space in code, and missing number. Instead of looking for
strings that contain a space, we are looking for strings that match
the valid pattern: a single letter followed by
digits. We demonstrate this using str.match
, a function similar to
str.contains
, just it looks if the beginning of the string matches
the pattern (str.contains
looks for the pattern anywhere in the
string).
This is fortunately easy to do using regular expressions.
This can be coded as r"\w\d+$"
:
str.match(r"\w\d+$", na=False).value_counts() titanic.cabin.
## False 1061
## True 248
## Name: cabin, dtype: int64
str.match
always assumes we have a regular expression, so there is
no regexp
argument. The “r” in front
of the expression means it is a “raw” string, i.e. the backslashes in
the string are interpreted literally, not as special codes. Inside of
the expression we have \w
for a letter (word character), followed by \d+
–one or
more digits. Final $
means that the string must end here, i.e. it
must not contain extra spaces or other symbols (such as another cabin
code).
The result shows that now we picked many more cases that do not match
the pattern, 248 instead of 41 above. However, not an important
difference between this approach, and the approach above: when looking
for patterns that contain a space we were looking for wrong patterns,
now we are looking for valid patterns. By specifying na=False
above, we told pandas not to consider missings as invalid patterns.
As we are only interested in invalid codes, not missings, we should
count missings as correct in the example below.
Here is an example of
mismatched cabin codes:
= titanic.cabin.str.match(r"\w\d+$", na=True)
i ~i].sample(6) titanic.cabin[
## 140 D10 D12
## 50 B51 B53 B55
## 994 F E46
## 56 B96 B98
## 941 F E57
## 51 B51 B53 B55
## Name: cabin, dtype: object
5.6.3 Replacing strings
Finally, let’s replace the full cabin number with only the deck code.
This can be done by replacing everything that follows the first letter
by an empty string. str.replace
takes two arguments: pattern and
replacement. We specify the former as r"(\w).*"
, i.e. a letter,
followed by an unspecified number of other symbols, and replace it
with just the same letter:
str.replace(r"(\w).*", r"\1").sample(6) titanic.cabin.
## <string>:1: FutureWarning: The default value of regex will change from True to False in a future version.
## 333 D
## 863 NaN
## 1228 NaN
## 1047 NaN
## 953 NaN
## 766 NaN
## Name: cabin, dtype: object
The result contains valid cabin numbers (and many missings).
In practical terms, it is often useful to print the original and modifies codes side-by-side to see that the coding was done right. We can achieve this by creating a new variable in the data frame, and printing out a sample of old and new variables (see Section @ref(concatenating-data-pd.concat) for another approach):
"deck"] = titanic.cabin.str.replace(r"(\w).*", r"\1")
titanic["cabin", "deck"]].dropna().sample(6) titanic[[
## cabin deck
## 145 D33 D
## 514 F2 F
## 222 D43 D
## 183 B101 B
## 21 D35 D
## 1024 F G73 F
We can see that the deck codes (the ones that are not missing) are done correctly.