Chapter 7 Descriptive Statistics

## ModuleNotFoundError: No module named 'pandas'

This section focuses on describing the data from statistical viewpoint. We discuss computing means and variances and plotting histograms. However, we do not do inferential statistics. We rely on Titanic data in these demonstrations.

titanic = pd.read_csv("../data/titanic.csv.bz2")
## NameError: name 'pd' is not defined
titanic.shape
## NameError: name 'titanic' is not defined

7.1 Central Tendency

Central tendency broadly describes the “typical” values in the data, out of all possible ones. Depending on the measure types, we can compute different kind of statistics.

If we are just handling categorical measures, we cannot really assess centrality. However, we can still assess what values are there in data, and how common they are. The pandas’ methods of choice are nunique, unique and value_counts (see Section 5.2.1). The first of these methods tells how many different values there is, the second one lists all the different values, and the third one also lists their frequency. Hence value_counts is the method we can use to see the distribution. Let’s look at the embarkation harbor, this is obviously a categorical variable, and it has values

titanic.embarked.unique()
## NameError: name 'titanic' is not defined

(“C” is Cherbourg, “S” is Southampton and “Q” is Queenstown, there are also missing entries.)

titanic.embarked.value_counts()
## NameError: name 'titanic' is not defined

Out of these three cities, Southampton was by far the most common one with over 900 embarkations, and Queenstown was the least popular.

As one can see, .unique includes missings in the list of returned values while .nunique and .value_counts ignore those.

If data can be ordered, we can also ask for smallest (.min), largest (.max) and middle (.median) values. In Titanic data, passenger class (pclass) is essentially ordered measure as the classes are clearly ranked. But arithmetic with classes does not make much sense despite the variable being coded as numeric. First, let’s look at the possible values of pclass:

titanic.pclass.unique()
## NameError: name 'titanic' is not defined

Now we can also compute range and median value:

titanic.pclass.median()
## NameError: name 'titanic' is not defined

The results show that the smallest class code was 1, the largest 3, and at least 50% of passengers were in the 3rd class.

Finally, in case of interval/ratio measures we can also add mean to the toolbox. The mean age of passengers was

titanic.age.mean()
## NameError: name 'titanic' is not defined

Numpy and pandas statistical functionality handles missings differently for numpy arrays and pandas series. In particular, in case of series, missings are ignored, but in case of arrays, a missing value results in the whole result to be missing. For instance, we create an array and a series with a missing value:

a = np.array([1, 2, np.nan])
s = pd.Series(a)
## NameError: name 'pd' is not defined
np.mean(a)
## np.float64(nan)
np.mean(s)
## NameError: name 's' is not defined
a.mean()
## np.float64(nan)
s.mean()
## NameError: name 's' is not defined

This is true not just for mean but also for other similar functions, such as median, var or min.

However, this is behavior is not completely consistent. For instance, it is not true for percentile:

np.percentile(s, 50)  # compute median
## NameError: name 's' is not defined
np.percentile(s.dropna(), 50)
## NameError: name 's' is not defined

7.2 Variability

The favorite information about variability include the value table in case of categorical values, range for ordered values, and variance/standard deviation for interval measures. The table of values can be obtained with .unique, see above.

Range can be computed with .min and .max:

titanic.age.min(), titanic.age.max()  # range (as a tuple)
## NameError: name 'titanic' is not defined

We see that the youngest passenger was 2 months old, and the oldest 80 years old. Variance and standard error can be computed with .var and .std:

titanic.age.var()
## NameError: name 'titanic' is not defined
titanic.age.std()
## NameError: name 'titanic' is not defined

While variance is hard to interpret, two times of standard deviation, 28 years, roughly corresponds to the “typical age range” of the passengers.

7.3 Distributions

The standard way to plot distributions is to use matplotlib’s hist function. The histogram of age. Its main argument is the data that can be in various forms, e.g. a series or a numpy array. plt.hist gives warnings if it encounters missings in the data, so it is advisable to remove missings:

import matplotlib.pyplot as plt
## ModuleNotFoundError: No module named 'matplotlib'
_ = plt.hist(titanic.age.dropna(), bins=30, edgecolor="k")
## NameError: name 'plt' is not defined
_ = plt.show()
## NameError: name 'plt' is not defined

The histogram show that 20-40-year olds were the dominating age group, but there is also a notable increase for young children. One can also adjust various properties, e.g. the number of bins with argument bins, colors, and more.

Alternatively, one can choose seaborn to plot density:

import seaborn as sns
## ModuleNotFoundError: No module named 'seaborn'
_ = sns.kdeplot(titanic.age)
## NameError: name 'sns' is not defined
_ = plt.show()
## NameError: name 'plt' is not defined

The sample quantiles can be computed with .quantile method, or with np.quantile and np.percentile functions. (Note: there is no .percentile method for series!) methods:

titanic.age.quantile(0.25)  # first quartile
## NameError: name 'titanic' is not defined
np.percentile(titanic.age.dropna(), [25, 75])  # first and third quartile
## NameError: name 'titanic' is not defined

Note that while series’ methods automatically remove missings, one has to do this manually for np.percentile.

7.4 Inequality

7.4.1 Pareto ratio

Pareto ratio can be computed simply by just adding all “wealth” of the upper x% of the cases, and computing this as a percentage of total cases. For the ratio to match, the percentage of the upper part must match to the corresponding sample quantile. Below, we demonstrate some of the computatations with treatment data. This dataset contains information about a labor market training program participation, but most importantly it also includes income data (re78). The income variable looks like

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
## NameError: name 'pd' is not defined
treatment.re78.head(5)
## NameError: name 'treatment' is not defined