Chapter 12 Statistical Inference
In Section 7 we introduced descriptive statistics. This section is devoted to inferential statistics–using statistical tools to say something about the “whole” based on a sample.
12.1 Population and sample
As we briefly discussed in Section 7, inferential statistics is concerned about telling something about the “whole” based on “sample”. For instance, below we demonstrate, based on examples, how you can tell something about the election results (the “whole”) based on a single poll (sample).
But before we start, we need to discuss something about what is the “whole” and what is the “sample”. In the most intuitive case, the “whole” is population, such as all eligible voters. Sample, in turn is a small subset of the population, for instance a poll conducted among the voters. For instance, the U.S. has approximately 200 million eligible voters, while typical poll asks the preferences of 1000 respondents only. Population is much much larger than the poll, in fact we typically think that the population is of infinite size.
But not all “populations” are like voters. For instance, if you get a biased coin, then you can flip it, say, 10 times, and count heads and tails. Or imagine that instead of a biased coin, we flip a bottle cap (there are good reasons to think that it may be biased). So you flip the bottle cap 10 times. This is the sample of size 10. But what is the “whole” here? One can think in terms of all similar bottle caps out there, but actually this is not the case. After all, you flip just a single one, so the other similar bottle caps are not relevant. Instead, you can imagine that the “whole” is all possible flips of the same bottle cap. You can imagine that there is an infinite number of bottle cap flips “out there”, and you just sampled 10 of those. The population–infinite number of flips–is not the flips, it is just a property of the bottle cap. Either it is unbiased (it will tend to give 50% of lower side and 50% of upper side when flipped), or it may be biased. So in this case “population” is not really a population but a property of a single object.21 But fortunately, despite of there being not infinite number of flips, we can analyze the bottle caps in exactly the same way as voters.
Finally, there are cases where it is not possible to get a sample of more than one. For instance, when trying to answer the question: “what is the chance that it will be raining tomorrow?” There will be one and only one tomorrow, and in that tomorrow it will either be raining or not. We cannot collect a sample of more than one tomorrows. Sure, there is a new tomorrow after tomorrow, but this is about a different day, not about tomorrow. Even more, if we want to answer this question today in order to decide about a picnic in a park, then we cannot have even a sample of one. We can still think in terms of all possible tomorrows, but we have to answer it without any sample at all, just based on our weather forecast models.
Below, we limit our discussion on the “voter example”, to the case where there, in fact, exists a large population, population that is much larger than any feasible sample we can collect.
12.2 Different ways of sampling data
Sampling is what describes which cases from the population end up in the sample. There is a plethora of ways to sample, and in many cases we do not even know how the data is sampled. Next, we discuss a few common sampling schemas and related problems.
12.2.2 Random sample
The case where all subjects in the population have equal probability to end up in the sample is commonly called random sample.22 This is perhaps the simplest sample one can do, and it’s properties are well understood. But it may be sometimes hard to achieve, and in other cases it may not even be desirable.
TBD: replacement/no replacement
12.2.3 Stratified sample
But random sampling is not always feasible, and even desirable. Imagine we are interested in the position of men and women on the labor market. Typically, women take more domestic responsibilities than men, while men work more out of home. If we now sample just single individuals, both men and women, then we do not learn much about their partners’ contribution at home or on the market. We should sample families instead of individuals. This is a stratified sample–first we sample strata, here families, and then we conduct a complete sample within strata (interview both the husband and wife, or whoever the adult family members are).
Another similar example is when analyzing school children. Lot of activities are shared by friends, and friends tend to attend the same school and be in the same class. So we may have two-level stratified sample here: first sample schools, thereafter classes within schools, and finally students within classes.
12.3 Example: election polls
Let us start the analysis of statistical inference with an example of election polling. Elections are quite important events in all democracies, and one can always see many polling results in media. Analysts and politicians frequently make their predictions or policy decisions based on such polls, so polling results are taken seriously.
The actual voting systems, sampling, and preferences are quite complicated, but let’s make it simple for us here. Assume:
- there are only two candidates (call them Chicken (C) and Egg (E)). The candidate who gets more votes will win the post.
- every voter prefers one of these two candidates, there are no undecided voters
- every voter truthfully tells their preference to the pollster
- every voter has equal chance to be sampled by the polling firm
We would like to demonstrate the following steps using data about all votes cast. But such data is not available. But actually, we do not even need such data. This is because we know that everyone voted either C or E. Hence the complete dataset will consists of millions of lines of C-s and E-s. As we do not care about the voter identity, the order of these C-s and E-s does not matter. What matters is just the final count–how many voted for Chicken and how many for Egg. And such data is easily available, after all, the winner is called based exactly such data.
Now we artificially create such a voter dataset. Assume we have 1M voters–we can imagine a smallish country or state. 1M is a large enough number for what we do below, and a large sample will unnecessarily strain the computer. This will be a data frame with 1M rows and a single variable “vote”. The rows represent voters, and “vote” can have values “C” for Chicken and “E” for Egg. Finally, in order to actually create the data, assume that 60% of voters prefer C.
We create such a data frame randomly and call it “votes”. It is can be created as
votes <- data.frame(vote = sample(c("C", "E"), # possible votes
size=1e6, # how many votes
replace=TRUE, # more than one C/E vote
prob = c(0.6, 0.4) # probability for C/E
)
)
See Appendix 12.6 for more details about how this is created. Here is a sample of the dataset:
## vote
## 1 C
## 2 C
## 3 C
## 4 C
In this small sample of 4, we see that all voters supported C.
But what will such a small sample tell us? Can we say, based on the sample that C is going to win? Intuitively, just asking four voters about their preferences is not going to tell us much about the whole electorate. But how much exactly is it telling us?
The how much question is not a trivial one to answer. To begin with, what would an acceptable answer even look like? Would “C will win” be an acceptable answer? Maybe… But such a small sample will not be able to give such an answer. What about “C may win”? Well, that is probably correct, but not informative… After all we know anyway that both candidates can win… It turns of a useful answer, and the only useful answer we can give based on a sample is something like “we are 95% certain that C will win”. Next, we’ll play with such samples, and show how we can get to similar conclusions.
Obviously, no serious polling firm will do polls of only 4 respondents. (But you may hear claims like "everyone I know is voting for C.) So let’s take a sample of 100, and compute the C’s vote share there:
votes %>%
sample_n(100) %>% # sample of 100
summarize(pctC = mean(vote == "C")) # percentage voting C
## pctC
## 1 0.6
(See 12.6 for explanations.)