Chapter 7 Descriptive Statistics
This section focuses on describing the data from statistical viewpoint. We discuss computing means and variances and plotting histograms. However, we do not do inferential statistics. We rely on Titanic data in these demonstrations.
= pd.read_csv("../data/titanic.csv.bz2")
titanic titanic.shape
## (1309, 14)
7.1 Central Tendency
Central tendency broadly describes the “typical” values in the data, out of all possible ones. Depending on the measure types, we can compute different kind of statistics.
If we are just handling categorical measures, we cannot really
assess centrality. However, we can still assess what values are there
in data, and how common they are. The pandas’ methods of choice are
nunique
, unique
and value_counts
(see Section 5.2.1).
The first of these methods tells how
many different values there is, the second one lists all the different
values, and the third one also lists their frequency. Hence value_counts
is the method we can use to see the distribution.
Let’s look at
the embarkation harbor, this is obviously a
categorical variable, and it has values
titanic.embarked.unique()
## array(['S', 'C', nan, 'Q'], dtype=object)
(“C” is Cherbourg, “S” is Southampton and “Q” is Queenstown, there are also missing entries.)
titanic.embarked.value_counts()
## S 914
## C 270
## Q 123
## Name: embarked, dtype: int64
Out of these three cities, Southampton was by far the most common one with over 900 embarkations, and Queenstown was the least popular.
As one can see, .unique
includes missings in the list of returned
values while .nunique
and .value_counts
ignore those.
If data can be ordered, we can also ask for smallest (.min
), largest
(.max
) and middle (.median
) values. In Titanic data, passenger
class (pclass) is essentially ordered measure as the classes are
clearly ranked. But arithmetic with classes does not make much sense
despite the variable being coded as numeric. First, let’s look at the
possible values of pclass:
titanic.pclass.unique()
## array([1, 2, 3])
Now we can also compute range and median value:
titanic.pclass.median()
## 3.0
The results show that the smallest class code was 1, the largest 3, and at least 50% of passengers were in the 3rd class.
Finally, in case of interval/ratio measures we can also add mean to the toolbox. The mean age of passengers was
titanic.age.mean()
## 29.8811345124283
Numpy and pandas statistical functionality handles missings differently for numpy arrays and pandas series. In particular, in case of series, missings are ignored, but in case of arrays, a missing value results in the whole result to be missing. For instance, we create an array and a series with a missing value:
= np.array([1, 2, np.nan])
a = pd.Series(a)
s np.mean(a)
## nan
np.mean(s)
## 1.5
a.mean()
## nan
s.mean()
## 1.5
This is true not just for mean but also for other similar functions,
such as median
, var
or min
.
However, this is behavior is not completely consistent. For instance, it is not true for percentile:
50) # compute median np.percentile(s,
## nan
50) np.percentile(s.dropna(),
## 1.5
7.2 Variability
The favorite information about variability include the value table in
case of categorical values, range for ordered values, and
variance/standard deviation for interval measures. The table of
values can be obtained with .unique
, see
above.
Range can be computed with .min
and .max
:
min(), titanic.age.max() # range (as a tuple) titanic.age.
## (0.1667, 80.0)
We see that the youngest passenger was 2 months old, and the oldest 80
years old.
Variance and standard error can be computed with .var
and .std
:
titanic.age.var()
## 207.74897359969756
titanic.age.std()
## 14.413499699923594
While variance is hard to interpret, two times of standard deviation, 28 years, roughly corresponds to the “typical age range” of the passengers.
7.3 Distributions
The standard way to plot distributions is to use matplotlib’s hist
function. The histogram of age. Its main argument is the data that
can be in various forms, e.g. a series or a numpy array. plt.hist
gives warnings if it encounters missings in the data, so it is
advisable to remove missings:
import matplotlib.pyplot as plt
= plt.hist(titanic.age.dropna(), bins=30, edgecolor="k")
_ = plt.show() _
The histogram show that 20-40-year olds were the dominating age group,
but there is also a notable increase for young children.
One can also adjust various properties, e.g. the number of bins with
argument bins
, colors, and more.
Alternatively, one can choose seaborn to plot density:
import seaborn as sns
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
## warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
= sns.kdeplot(titanic.age)
_ = plt.show() _
The sample quantiles can be computed with .quantile
method, or with
np.quantile
and np.percentile
functions. (Note: there is no
.percentile
method for series!)
methods:
0.25) # first quartile titanic.age.quantile(
## 21.0
25, 75]) # first and third quartile np.percentile(titanic.age.dropna(), [
## array([21., 39.])
Note that while series’ methods automatically remove missings, one has
to do this manually for np.percentile
.
7.4 Inequality
7.4.1 Pareto ratio
Pareto ratio can be computed simply by just adding all “wealth” of the upper x% of the cases, and computing this as a percentage of total cases. For the ratio to match, the percentage of the upper part must match to the corresponding sample quantile. Below, we demonstrate some of the computatations with treatment data. This dataset contains information about a labor market training program participation, but most importantly it also includes income data (re78). The income variable looks like
= pd.read_csv("../data/treatment.csv.bz2", sep="\t")
treatment 5) treatment.re78.head(
## 0 9930.05
## 1 24909.50
## 2 7506.15
## 3 289.79
## 4 4056.49
## Name: re78, dtype: float64
The dataset contains 2675 cases. The total income across all these cases is
= treatment.re78.sum()
total total
## 54843856.0104
Let us compute the total income share, earned by the richest 30%.
We can use np.percentile
or np.quantile
to compute the threshold
between the lower 70% and upper 30% by
= np.percentile(treatment.re78, 70) # top 30th percentile
threshold threshold
## 26599.1
And now the income of the richest 30% is
= treatment.re78[treatment.re78 > threshold].sum()
top30 top30
## 30274684.2
As share of the total income this is
= top30/total
share share
## 0.5520159668251451
So in this data, the richest 30% of individuals earn 55% of all income. This is not the Pareto ratio, as in that case they should earn 70% of all income. One can experiment with different numbers to find the exact ratio, or alternatively find a loop that just checks every single percentile.
Exercise 7.1 Find the exact 80-20 ratio for this income data.
Hint: loop over the different percentiles, and see at which percentile we are close to capturing the corresponding income share
See the solution