Chapter 7 Descriptive Statistics
## ModuleNotFoundError: No module named 'pandas'
This section focuses on describing the data from statistical viewpoint. We discuss computing means and variances and plotting histograms. However, we do not do inferential statistics. We rely on Titanic data in these demonstrations.
## NameError: name 'pd' is not defined
## NameError: name 'titanic' is not defined
7.1 Central Tendency
Central tendency broadly describes the “typical” values in the data, out of all possible ones. Depending on the measure types, we can compute different kind of statistics.
If we are just handling categorical measures, we cannot really
assess centrality. However, we can still assess what values are there
in data, and how common they are. The pandas’ methods of choice are
nunique, unique and value_counts
(see Section 5.2.1).
The first of these methods tells how
many different values there is, the second one lists all the different
values, and the third one also lists their frequency. Hence value_counts
is the method we can use to see the distribution.
Let’s look at
the embarkation harbor, this is obviously a
categorical variable, and it has values
## NameError: name 'titanic' is not defined
(“C” is Cherbourg, “S” is Southampton and “Q” is Queenstown, there are also missing entries.)
## NameError: name 'titanic' is not defined
Out of these three cities, Southampton was by far the most common one with over 900 embarkations, and Queenstown was the least popular.
As one can see, .unique includes missings in the list of returned
values while .nunique and .value_counts ignore those.
If data can be ordered, we can also ask for smallest (.min), largest
(.max) and middle (.median) values. In Titanic data, passenger
class (pclass) is essentially ordered measure as the classes are
clearly ranked. But arithmetic with classes does not make much sense
despite the variable being coded as numeric. First, let’s look at the
possible values of pclass:
## NameError: name 'titanic' is not defined
Now we can also compute range and median value:
## NameError: name 'titanic' is not defined
The results show that the smallest class code was 1, the largest 3, and at least 50% of passengers were in the 3rd class.
Finally, in case of interval/ratio measures we can also add mean to the toolbox. The mean age of passengers was
## NameError: name 'titanic' is not defined
Numpy and pandas statistical functionality handles missings differently for numpy arrays and pandas series. In particular, in case of series, missings are ignored, but in case of arrays, a missing value results in the whole result to be missing. For instance, we create an array and a series with a missing value:
## NameError: name 'pd' is not defined
## np.float64(nan)
## NameError: name 's' is not defined
## np.float64(nan)
## NameError: name 's' is not defined
This is true not just for mean but also for other similar functions,
such as median, var or min.
However, this is behavior is not completely consistent. For instance, it is not true for percentile:
## NameError: name 's' is not defined
## NameError: name 's' is not defined
7.2 Variability
The favorite information about variability include the value table in
case of categorical values, range for ordered values, and
variance/standard deviation for interval measures. The table of
values can be obtained with .unique, see
above.
Range can be computed with .min and .max:
## NameError: name 'titanic' is not defined
We see that the youngest passenger was 2 months old, and the oldest 80
years old.
Variance and standard error can be computed with .var and .std:
## NameError: name 'titanic' is not defined
## NameError: name 'titanic' is not defined
While variance is hard to interpret, two times of standard deviation, 28 years, roughly corresponds to the “typical age range” of the passengers.
7.3 Distributions
The standard way to plot distributions is to use matplotlib’s hist
function. The histogram of age. Its main argument is the data that
can be in various forms, e.g. a series or a numpy array. plt.hist
gives warnings if it encounters missings in the data, so it is
advisable to remove missings:
## ModuleNotFoundError: No module named 'matplotlib'
## NameError: name 'plt' is not defined
## NameError: name 'plt' is not defined
The histogram show that 20-40-year olds were the dominating age group,
but there is also a notable increase for young children.
One can also adjust various properties, e.g. the number of bins with
argument bins, colors, and more.
Alternatively, one can choose seaborn to plot density:
## ModuleNotFoundError: No module named 'seaborn'
## NameError: name 'sns' is not defined
## NameError: name 'plt' is not defined
The sample quantiles can be computed with .quantile method, or with
np.quantile and np.percentile functions. (Note: there is no
.percentile method for series!)
methods:
## NameError: name 'titanic' is not defined
## NameError: name 'titanic' is not defined
Note that while series’ methods automatically remove missings, one has
to do this manually for np.percentile.
7.4 Inequality
7.4.1 Pareto ratio
Pareto ratio can be computed simply by just adding all “wealth” of the upper x% of the cases, and computing this as a percentage of total cases. For the ratio to match, the percentage of the upper part must match to the corresponding sample quantile. Below, we demonstrate some of the computatations with treatment data. This dataset contains information about a labor market training program participation, but most importantly it also includes income data (re78). The income variable looks like
## NameError: name 'pd' is not defined
## NameError: name 'treatment' is not defined