Chapter 12 Statistical Inference
In Section 7 we introduced descriptive statistics. This section is devoted to inferential statistics–using statistical tools to say something about the “whole” based on a sample. As you can imagine, such task, using one thing (sample) to say something about another thing (the “whole”) is more complicated, and requires both more careful data collection and more complex analytical methods. This section is devoted on these tasks.
12.1 Population and sample
As we briefly discussed in Section 7, inferential statistics is concerned about telling something about the “whole” based on “sample”. For instance, if you are working for election polling, you may want to use your sample of voters (typically around 1000) to tell something about the election outcome (typically decided by millions of voters).
But before we start, let’s make clear what is the “whole” and what is the “sample”. In the most intuitive case, the “whole” is population, such as all voters in case of election forecasting. Sample, in turn is a small subset of the population, for instance a poll conducted among the voters. In national elections, millions of voters cast their ballots, while the forecasters typically work on a sample of 1000 voters only. Population is much much larger than the poll, in fact we typically think that the population is of infinite size.

Knucklebones, like this medieval bone from a shipwreck in Northern Europe, are a form of dice, used to play games. Unlike modern coins, they are likely not fair.
Rijksdienst voor het Cultureel Erfgoed, CC BY-SA 4.0, via Wikimedia CommonsBut not all “populations” are like voters. For instance, you may try to determine if a coin is biased. You may flip the coin ten times and count heads and tails. This is the sample of size 10. But what is the “whole” or the “population” here? One can think in terms of all other similar coins out there, but actually this is not the case. After all, you flip just a single one, so the other similar coins do not play a role. Instead, you can imagine that the “whole” is all possible flips of the same coin. You can imagine that there is an infinite number of coin flips “out there”, and you just sampled 10 of those. The population–infinite number of flips–is not the flips, it is just a property of the coin. Either it is unbiased (it will tend to give 50% of one side and 50% of the other side when flipped), or it may be biased. So in this case “population” is not really a population but a property of a single object.21 But fortunately, we can analyze such properties of the coins in exactly the same way as we analyze voters.
Finally, there are examples where it is not possible to get a sample of more than one. For instance, what samply you might collect data to answer the question: “what is the chance that it will be raining tomorrow?” There will be one and only one tomorrow, and in that tomorrow it will either be raining or not. We cannot collect a sample of more than one tomorrows. Even more, if we want to answer this question today in order to decide about a picnic in a park, then we cannot collect even the sample of one. We can still think in terms of all possible tomorrows, but we have to answer it without any sample at all.
Below, we limit our discussion with the “voter example” where there exists a large population, population that is much larger than any sample we can realistically collect.
12.2 Different ways of sampling data
For many tasks it is extremely important to know what cases from the population end up in the sample. This process–what are the criteria that make certain cases to end up in the sample–is called sampling. There is a plethora of ways to make a sample, and frequently we do not know what exactly was the process. Here we discuss a few common ways to do sampling, and the problems related to different sampling methods.
12.2.1 Complete sample
complete sample is the case where we can actually measure every single subject of interest. For instance, we can sample every single student who takes the course. This is perhaps the simplest possible sampling method, in a sense it is not sampling at all but we are observing the population instead.
But complete sampling has a number of problems.
First, and most obviously, it may not be feasible to sample everyone. Think about the election polls where the pollsters should survey hundreds of millions of people. This is prohibitively expensive.
Fireworks are wonderful, but each rocket can only be used once. Hence you have to rely on a sample to test the products.
Seattle, July 4th 2022.Second, for many tasks, “sampling” also means destroying the object (called destructive testing). Imagine you are working in a factory that produces fireworks. You follow the specifications and the safety protocols–but do your rockets actually work as intended? The only way to find it out is to “go bang” and try them out! But you only want to do this with a small sample–complete sample would mean to shoot all your rockets and leave nothing to sell…
Third, even if you sampled everyone, are you sure that you actually observed everyone? This may sound like a semantic nitpicking, but it is actually an important question. The answer depends on what exactly do you want to do. If you are only interested in students in that particular class, in that particular quarter, taught by that particular professor, then observing everyone who takes the class is indeed a complete sample. But often we are interested in a more general question–for instance we want to know something about all students who take that class, including the past and future ones. Sampling everyone in the current quarter is not the complete sample any more.
This question is also discussed in Section 7.1.
Finally, quite often the problem is not that we cannot sample everyone, it may actually quite easy to take into account the resulting uncertainty. Instead, the problem is that we do not know who exactly ends up in our sample. This may result in biased data (see Section 12.2.4) and wrong results.
12.2.2 Random sample
The case where all subjects in the population have equal probability to end up in the sample is commonly called random sample.22 This is perhaps the simplest sample one can do, and it’s properties are well understood. Although it is simple, it may be hard to achieve, and in other cases it may not even be desirable.
An example of random sampling might be a household survey. Out of all households in a city, the survey may randomly select a smaller number (say, 1000). Nowadays, this is done using computers and random numbers, but historically one might have used other tools, for instance by picking random numbers out of a hat. This approach is a good choice when there is a known finite number of households.
Alternatively, one may individually decide for every single product whether to sample it or not. Imagine a robotic “hand” next to your fireworks production line. Based on a random number, the hand decides for every single firecracker whether to pull it aside to the test sample or not. This approach works well if the number of firecrackers is not finite but new ones are continuously made, and we want to test, say, one out of 1000 firecrackers.
A slightly modified random sample includes oversampling. For instance, imagine that your firework rockets work well and you are happy to test only 0.1% of those. But you have had a lot of trouble lately with poppers, and hence you want to test them more, maybe 1% of poppers. So your poppers will be 10 times oversampled, compared to the rockets. Data that involves oversampling is also fairly straightforward to analyze, given we know how it is done.
12.2.3 Stratified sample
Random sample is easy to understand and work with, but it is not always feasible, and not even desirable. Imagine we are interested in the position of men and women on the labor market. Typically, women take more domestic responsibilities than men, while men work more out of home. If we now sample just single individuals, both men and women, then we do not learn much about their partners’ contribution at home or on the market. We should sample households instead of individuals. This is a stratified sample–first we sample strata, here households, and then we conduct a complete sample within strata (interview both the husband and wife, or whoever the adult family members are).
Another somewhat similar example is an analysis of school-age children behavior. Lot of activities are shared by friends, and friends tend to attend the same school and be in the same class. So we may have two-level stratified sample here: first sample schools, thereafter classes within schools, and finally students within classes.
12.2.4 Representative and biased sample
A random sample is usually the easiest sample to work with if we want to do statistical inference, to learn something about the world, not just about the sample. It has a number of well-known properties, for instance the sample proportion tends to be similar to the populations’ proportion; and larger sample will give more precise and trustworthy results. Such samples are called representative samples, they “represent” everyone in the population in a similar fashion.
The case with oversampling, or with stratified samples are not too complicated either, although you may need to correct the sample averages if you want to compute the population average. If you know how the strata has been chosen, or how the oversampling is done, such correction is not hard to do.
Unfortunately, we often have to rely on data where we do not know how it is sampled. It may happen for different reasons, e.g. because of missing documentation. Nowadays, however, it is easy to collect various data but much harder to understand sampling. Below we’ll discuss a few examples.
- Product reviews, such as Amazon reviews. It is tempting to count review stars as an indication of the product quality, and in a way it is. But which cases are sampled? Is this a representative sample? The answer, unfortunately, is “probably not”. We can guess that the reviews are written by people who are either very unhappy with the product, or maybe very happy with the product, or maybe someone who just loves to express their opinion even if they happen to have none. The majority who are “just happy” may not bother to write. Unfortunately, we do not know how big is the “silent majority”, or whether the considerations above are even correct. We just do not know, and hence we should be very careful when assuming that the reviews reflect the true product quality.
- Population surveys, such as election polls. Although pollsters do their best to ensure the surveys are representative and the document the sampling procedure well, this cannot be done perfectly. They typically do not have access to complete population registries but only to proxies, such as phone books. Different people have different inclination to fill out the surveys; and if they do, they may or may not tell the truth. They may make their decision in the last minute, and they may change their mind after the poll. All this is probably correlated to their favored candidates. And none of it can be easily taken into account. As a conclusion, despite all the efforts, surveys and polls often produce wrong results.
- Social media results. It is tempting to generalize from your friends and from your social media feed. But this may be grossly misleading–people tend to have friends who are similar to themselves in many ways, and even if none of your friends shares certain political viewpoints, that does not mean that those viewpoints are not well represented.
The previous examples were examples of biased sample–a sample that is not representative, and where we do not know how to correct the related bias. In all these cases one should be very careful when generalizing from the sample. For instance, in case of product reviews, one might claim that “Amazon reviewers prefer product X to Y”, instead of saying that “X is better than Y”.
Another common trait with biased samples is that larger samples will not necessarily produce better results. Big is not always better.
12.3 Example: election polls
Let us start the analysis of statistical inference with an example of election polling. Elections are quite important events in all democracies, and one can always see many polling results in media. Analysts and politicians frequently make their predictions or policy decisions based on such polls, so polling results are taken seriously.
The actual voting systems, sampling, and preferences are quite complicated, but let’s make it simple for us here. Assume:
- there are only two candidates (call them Chicken (C) and Egg (E)). The candidate who gets more votes will win the post.
- every voter prefers one of these two candidates, there are no undecided voters
- every voter truthfully tells their preference to the pollster
- every voter has an equal chance to be sampled by the polling firm
We would like to demonstrate the following steps using data about all votes cast. But such data is not available. But actually, we do not even need such data. This is because we know that everyone voted either C or E. Hence the complete dataset will consists of millions of lines of C-s and E-s. As we do not care about the voter identity, the order of these C-s and E-s does not matter. What matters is just the final count–how many voted for Chicken and how many for Egg. And such data is easily available, after all, the winner is called based exactly such data.
Now we artificially create such a voter dataset. Assume we have 1M voters–we can imagine a smallish country or state. 1M is a large enough number for what we do below, and a large sample will unnecessarily strain the computer. This will be a data frame with 1M rows and a single variable “vote”. The rows represent voters, and “vote” can have values “C” for Chicken and “E” for Egg. Finally, in order to actually create the data, assume that 60% of voters prefer C.
We create such a data frame randomly and call it “votes”. It is can be created as
data.frame(vote = sample(c("C", "E"), # possible votes
votes <-size=1e6, # how many votes
replace=TRUE, # more than one C/E vote
prob = c(0.6, 0.4) # probability for C/E
) )
See Appendix 12.6 for more details about how this is created. Here is a sample of the dataset:
%>%
votes sample_n(4)
## vote
## 1 C
## 2 C
## 3 C
## 4 C
In this small sample of 4, we see that all voters supported C.
But what will such a small sample tell us? Can we say, based on the sample that C is going to win? Intuitively, just asking four voters about their preferences is not going to tell us much about the whole electorate. But how much exactly is it telling us?
The how much question is not a trivial one to answer. To begin with, what would an acceptable answer even look like? Would “C will win” be an acceptable answer? Maybe… But such a small sample will not be able to give such an answer. What about “C may win”? Well, that is probably correct, but not informative… After all we know anyway that both candidates can win… It turns of a useful answer, and the only useful answer we can give based on a sample is something like “we are 95% certain that C will win”. Next, we’ll play with such samples, and show how we can get to similar conclusions.
Obviously, no serious polling firm will do polls of only 4 respondents. (But you may hear claims like "everyone I know is voting for C.) So let’s take a sample of 100, and compute the C’s vote share there:
%>%
votes sample_n(100) %>% # sample of 100
summarize(pctC = mean(vote == "C")) # percentage voting C
## pctC
## 1 0.6
(See 12.6 for explanations.)
Summary
sample: the information we collect about the process we want to analyze. Sample is often the same as “data”, this is what we know about the process.
sampling is the process that describes how certain cases end up in the sample. It is typically a stochastic process, where different cases may have similar or different probability to end up in the sample.
There are different ways of sampling:
- complete sample: include everything
- random sample: everyone has the same probability to be in the sample. This is the easiest sample to work with.
- stratified sample: “multi-level” sampling, for instance first you sample schools, and thereafter students withing the schools
- representative sample and biased sample: representative sample tend to give correct results; to get correct results with biased sample, you need to take into account the bias.
population: often it is only feasible to collect data (to measure) a small number of objects we are interested in. All the objects of interest together form the population. But there are processes where population is not about a large number, but about a certain properties insted.
12.6 Appendix: random numbers
R makes it easy to create a variety of random numbers and other random values. Here we discuss just a few: random integers, random values, and uniformly and normally distributed random numbers.
12.6.1 Random integers
Random integers can be created as sample(K, N, replace = TRUE)
.
This creates N
random integers between 1 and K. For instance, let’s create 10
integers between 1 and 3:
sample(3, 10, replace = TRUE)
## [1] 2 3 2 1 3 1 1 3 1 3
replace = TRUE
means that all these numbers can occur more than
once. If you leave this out, then you’ll get an error in this case
because you cannot draw 10 different random numbers out of 1, 2, 3:
sample(3, 10)
## Error in sample.int(x, size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
But another time you may need to ensure that every number is selected only once. For instance, if you want to allocate people to seats in a random fashion, then you only want to assign one person on each seat!
12.6.2 Other random values
sample()
has a slightly different version where you can randomly
select any value, not just numbers \(1\dots K\). For instance
sample(c("A", "B"), 10, replace = TRUE)
## [1] "B" "A" "A" "B" "A" "B" "A" "A" "A" "B"
Will create a sequence of “A”-s and “B”-s, selected randomly.
Compared to how you can create random integers, this is mostly
similar. Just instead of the largest number \(K\), you need to supply a
vector of values, here c("A", "B")
. The function c()
creates
vectors, and "A"
and "B"
are the values you select from.
sample()
has also other arguments, for instance you can select
different probabilities for different values.
12.6.3 Uniform random numbers
Uniform random numbers are fractions, uniformly distributed between 0
and 1. You can create \(N\) of those with runif(N)
. Here 5 uniform
random numbers:
runif(5)
## [1] 0.1975168 0.6138803 0.9076867 0.1453091 0.8720293

Histogram of 1000 random uniform numbers. They can have any value between 0 and 1.
If you make a histogram of these values, the result will look like a brick at right:
hist(runif(1000))
The histogram covers values from 0 to 1, and the bars that are about equal height show that different numbers in this range are equally likely.
See Section 10.2.1 for more about histograms.
12.6.4 Normal random numbers
Normally distributed random numbers are similar to the uniform ones in the sense that they can take any value. But unlike the uniform, they do not have a lower or upper limit. Instead, the values around zero are more common and the values further away increasingly less common.
You can create normal random numbers with rnorm(N)
, for instance,
here are five normal numbers:
rnorm(5)
## [1] 2.0429404 1.0126973 -0.7434544 -0.9344093 -0.4591088

Histogram of 1000 random normal numbers. These can have any value, but values away from zero are increasingly less likely. This results in a well-known bell-shaped histogram.
The histogram of the normal numbers looks like the well-known bell curve:
hist(rnorm(1000))
In clearly reveals that most common values are near zero, but other values, both positive and negative, are also possible. On this figure, the smallest values are around -3.5 and the largest ones approximately +3.5.
12.6.5 Replicable random numbers
Random numbers are, well, random. If you want to generate these again, they end up being different. For instance, the first batch is
rnorm(4)
## [1] 0.60607002 0.12193154 -0.03561622 0.72571899
but the second batch is clearly different:
rnorm(4)
## [1] -0.09233786 0.18881449 0.81680946 -0.56075238
This is sometimes exactly what you want–after all, there is little reason to make “random” numbers that are exactly the same. But other times this creates problems. For instance, if you want to discuss the smallest or the largest values, they tend to change each time you run your code, and hence you need to change the text accordingly each time.
As a solution, you can fix the random number seed. This is “initial
value” of the random numbers, and after fixing the seed, the numbers
come out exactly the same. Seed can be fixed with set.seed(S)
where
S is a number–different S values correspond to different random
number sequences, but the same S value will always give you the same
numbers.
Here is an example where we generate two similar sequences of numbers. Pick seed 7:
set.seed(7)
rnorm(4)
## [1] 2.2872472 -1.1967717 -0.6942925 -0.4122930
If you now just create a new sequence of numbers without re-setting the seed at seven, they end up different:
rnorm(4)
## [1] -0.9706733 -0.9472799 0.7481393 -0.1169552
But if you want to replicate the first sequence, you can set the seed to seven again:
set.seed(7)
rnorm(4)
## [1] 2.2872472 -1.1967717 -0.6942925 -0.4122930
Now the results are exactly the same as above.
TBD: create data frames
TBD: compute proportions
library(tidyverse)
options(width=66, pillar.width=66)
set.seed(34)