Chapter 7 Descriptive Statistics

This section focuses on describing the data from statistical viewpoint. We discuss computing means and variances and plotting histograms. However, we do not do inferential statistics. We rely on Titanic data in these demonstrations.

titanic = pd.read_csv("../data/titanic.csv.bz2")
titanic.shape
## (1309, 14)

7.1 Central Tendency

Central tendency broadly describes the “typical” values in the data, out of all possible ones. Depending on the measure types, we can compute different kind of statistics.

If we are just handling categorical measures, we cannot really assess centrality. However, we can still assess what values are there in data, and how common they are. The pandas’ methods of choice are nunique, unique and value_counts (see Section 5.2.1). The first of these methods tells how many different values there is, the second one lists all the different values, and the third one also lists their frequency. Hence value_counts is the method we can use to see the distribution. Let’s look at the embarkation harbor, this is obviously a categorical variable, and it has values

titanic.embarked.unique()
## array(['S', 'C', nan, 'Q'], dtype=object)

(“C” is Cherbourg, “S” is Southampton and “Q” is Queenstown, there are also missing entries.)

titanic.embarked.value_counts()
## S    914
## C    270
## Q    123
## Name: embarked, dtype: int64

Out of these three cities, Southampton was by far the most common one with over 900 embarkations, and Queenstown was the least popular.

As one can see, .unique includes missings in the list of returned values while .nunique and .value_counts ignore those.

If data can be ordered, we can also ask for smallest (.min), largest (.max) and middle (.median) values. In Titanic data, passenger class (pclass) is essentially ordered measure as the classes are clearly ranked. But arithmetic with classes does not make much sense despite the variable being coded as numeric. First, let’s look at the possible values of pclass:

titanic.pclass.unique()
## array([1, 2, 3])

Now we can also compute range and median value:

titanic.pclass.median()
## 3.0

The results show that the smallest class code was 1, the largest 3, and at least 50% of passengers were in the 3rd class.

Finally, in case of interval/ratio measures we can also add mean to the toolbox. The mean age of passengers was

titanic.age.mean()
## 29.8811345124283

Numpy and pandas statistical functionality handles missings differently for numpy arrays and pandas series. In particular, in case of series, missings are ignored, but in case of arrays, a missing value results in the whole result to be missing. For instance, we create an array and a series with a missing value:

a = np.array([1, 2, np.nan])
s = pd.Series(a)
np.mean(a)
## nan
np.mean(s)
## 1.5
a.mean()
## nan
s.mean()
## 1.5

This is true not just for mean but also for other similar functions, such as median, var or min.

However, this is behavior is not completely consistent. For instance, it is not true for percentile:

np.percentile(s, 50)  # compute median
## nan
np.percentile(s.dropna(), 50)
## 1.5

7.2 Variability

The favorite information about variability include the value table in case of categorical values, range for ordered values, and variance/standard deviation for interval measures. The table of values can be obtained with .unique, see above.

Range can be computed with .min and .max:

titanic.age.min(), titanic.age.max()  # range (as a tuple)
## (0.1667, 80.0)

We see that the youngest passenger was 2 months old, and the oldest 80 years old. Variance and standard error can be computed with .var and .std:

titanic.age.var()
## 207.74897359969756
titanic.age.std()
## 14.413499699923594

While variance is hard to interpret, two times of standard deviation, 28 years, roughly corresponds to the “typical age range” of the passengers.

7.3 Distributions

The standard way to plot distributions is to use matplotlib’s hist function. The histogram of age. Its main argument is the data that can be in various forms, e.g. a series or a numpy array. plt.hist gives warnings if it encounters missings in the data, so it is advisable to remove missings:

import matplotlib.pyplot as plt

_ = plt.hist(titanic.age.dropna(), bins=30, edgecolor="k")
_ = plt.show()
plot of chunk unnamed-chunk-12

plot of chunk unnamed-chunk-12

The histogram show that 20-40-year olds were the dominating age group, but there is also a notable increase for young children. One can also adjust various properties, e.g. the number of bins with argument bins, colors, and more.

Alternatively, one can choose seaborn to plot density:

import seaborn as sns
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
##   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
_ = sns.kdeplot(titanic.age)
_ = plt.show()
plot of chunk unnamed-chunk-13

plot of chunk unnamed-chunk-13

The sample quantiles can be computed with .quantile method, or with np.quantile and np.percentile functions. (Note: there is no .percentile method for series!) methods:

titanic.age.quantile(0.25)  # first quartile
## 21.0
np.percentile(titanic.age.dropna(), [25, 75])  # first and third quartile
## array([21., 39.])

Note that while series’ methods automatically remove missings, one has to do this manually for np.percentile.

7.4 Inequality

7.4.1 Pareto ratio

Pareto ratio can be computed simply by just adding all “wealth” of the upper x% of the cases, and computing this as a percentage of total cases. For the ratio to match, the percentage of the upper part must match to the corresponding sample quantile. Below, we demonstrate some of the computatations with treatment data. This dataset contains information about a labor market training program participation, but most importantly it also includes income data (re78). The income variable looks like

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
treatment.re78.head(5)
## 0     9930.05
## 1    24909.50
## 2     7506.15
## 3      289.79
## 4     4056.49
## Name: re78, dtype: float64

The dataset contains 2675 cases. The total income across all these cases is

total = treatment.re78.sum()
total
## 54843856.0104

Let us compute the total income share, earned by the richest 30%. We can use np.percentile or np.quantile to compute the threshold between the lower 70% and upper 30% by

threshold = np.percentile(treatment.re78, 70)  # top 30th percentile
threshold
## 26599.1

And now the income of the richest 30% is

top30 = treatment.re78[treatment.re78 > threshold].sum()
top30
## 30274684.2

As share of the total income this is

share = top30/total
share
## 0.5520159668251451

So in this data, the richest 30% of individuals earn 55% of all income. This is not the Pareto ratio, as in that case they should earn 70% of all income. One can experiment with different numbers to find the exact ratio, or alternatively find a loop that just checks every single percentile.

Exercise 7.1 Find the exact 80-20 ratio for this income data.

Hint: loop over the different percentiles, and see at which percentile we are close to capturing the corresponding income share

See the solution