Significance testing
AKA hypothesis testing
can we reject sampling error as an explanation for the
result?
may only be applied to data based on probability samples
steps
1) define null and alternate hypotheses
-
null hypothesis - no relationship, no difference, chance,
sampling error
-
alternate hypothesis - some relationship/difference, not
chance, not sampling error
-
difference between men and women in ideal number of children,
1996 GSS
-
null = no difference, alternate = men and women differ
2) summarize data and generate the test statistic
-
quantity used to determine probability of observing sampled
data if null hypothesis true for population (e.g., t-statistic, chi-square
statistic, z-score, etc.)
3) determine probability of observing sample data
if null hypothesis true in population
-
probability value (p-value) - probability result would
occur if null hypothesis were true
-
does NOT represent probability that alternate hypothesis
is true
-
sex difference in ideal number of children: p = .26
4) make a decision
-
accept or reject null hypothesis based on 3)
-
requires specifying, before analysis, probability level
that is low enough to reject null hypothesis (alpha level)
-
.05 = social/behavioral science standard
-
if do not reject the null hypothesis, then result is "statistically
nonsignificant"
-
if do reject the null hypothesis, then result is "statistically
significant"
relationship to confidence intervals
-
if CI for a given level of confidence includes value representing
null hypothesis, then result is statistically nonsignificant at alpha level
of 1 - confidence level
-
if CI does not include value representing the null
hypothesis, then result is statistically significant at same alpha level
-
sex difference in number of children, 1996 GSS
-
95% CI for difference = .15-.39
-
null hypothesis = no difference
-
0 not included in CI, therefore, reject null hypothesis
-
statistically significant difference, p < .05
type I error - rejecting null hypothesis when it is true
-
p-value = likelihood of type 1 error
-
reduce chance of type 1 error by keeping alpha level low
type II error - rejecting alternative hypothesis when
it is true
-
can't specify likelihood of type 2 error because alternative
hypothesis really is infinite set of alternative hypotheses
-
null = no difference (e.g., Gorton/Cantwell tied)
-
alternative = some difference (e.g., Cantwell 50.1%, 50.2%,
50.3%, etc.)
-
power of a test = likelihood of accepting the alternative
hypothesis when it is true
-
can be increased by increasing sample size
-
precision increases with sample size (CI narrows)
Factors influencing statistical significance
p-value a function of: sample size x magnitude of difference/relationship
-
very weak relationships can be "statistically significant"
as a result of large sample size
-
GSS, 1972-96: race x living alone
-
% living alone: black = 22.0, white = 20.1
-
RR of living alone for blacks vs. whites = 1.09
-
strong relationships can be "statistically nonsignificant"
as a result of a small sample size
-
hypothetical example - weights of 2 randomly selected
cats and 2 randomly selected dogs from animal shelter
-
cats = 8 and 12 lbs., mean = 10
-
dogs = 30 and 55 lbs., mean = 42.5
-
difference in means = 32.5 lbs., but p = .12
-
more likely to observe statistically significant result
with greater number of significance tests
-
if alpha = .05, then expect 1 out of 20 tests to be significant
by chance
-
hypothetical example: randomized experiment to evaluate
a curriculum designed to promote productive, healthy lifestyles
-
students randomly assigned to receive the curriculum or
not
-
25 different outcomes thought to be affected by curriculum
(e.g., criminal behavior, drug use, employment, STD infection, volunteering,
quality of family relationships, etc.)
-
2 of 25 hypothesis tests p < .05 in favor of curriculum
-
significance testing promotes fishing expeditions - conducting
many significance tests and then reporting just the significant ones
When are significance tests useful and legitimate?
when a single study must guide immediate action (e.g.,
court decision, clinical trial of a drug/procedure for serious disease,
predicting winner of election on election night, etc.)
Major problems with significance testing
1) confuses "statistical significance" with practical
significance
2) boils down quantification of data & analysis
to yes/no decision
-
nature is not black and white and neither should our conclusions
be
3) virtually all relationships/ differences are
nonzero
-
null hypothesis is almost never true
4) one study does not resolve a question of any
scientific or practical importance
bottom line: p-values and significance testing irrelevant
in most cases
appropriate action: focus on descriptive statistics,
CIs, and pattern of results across studies