Regression
and Correlation
notes about Utts
-
ignore statistical significance
for now (goes with inferential statistics)
-
measurement variables = interval
scale
bivariate relationships between
interval scale variables
first step, display data
w/ scatterplot
-
x-axis = independent variable
-
y-axis = dependent variable
-
direction of relationship
- positive (up to right), negative (down to right)
Simple linear regression
goal: summarize relationship
between X and Y -- use X to predict Y
regression equation: y =
a + bX
-
y-intercept (a) - value of
Y when line crosses Y-axis (i.e., when X = 0)
-
slope (b) - rise/run
- change in Y with a unit change in X
-
regression line minimizes
squared errors from predicted values (least squares line)
-
sum of squared errors = SSE
-
errors in prediction = residuals
-
estimating Y given X: plug
value of X into equation to predict Y
Issues to consider for
regression
-
nonlinear relationships -
linear regression based on assumed linear association
-
homoschedasticity/heteroschedasticity
- errors in prediction roughly constant (homo) or variable (hetero) over
range of values
r2 as a PRE
measure - strength of association
Proportional Reduction
in Error (PRE)
PRE = (E1 - E2)
/ E1
-
E1 = errors in
predicting dv when iv ignored
-
E2 = errors in
predicting dv when prediction is based on iv
-
how much can we reduce errors
in predicting d.v. by considering i.v.?
-
100% reduction (1.0) - perfect
prediction
-
0% reduction (0.0) - i.v.
does not help in predicting
r2:
E1 = predict values of Y
based on Y bar (mean)
E2: predict Y
based on X and regression line
-
sum of (Y - Y hat)2
(deviation of observed Y from regression line, squared)
Correlation: r
Pearson correlation coefficient
or product-moment coefficient
indicates how closely observed
values fall around regression line, clustering
about line
r = square root of r2
and takes sign of slope
ranges between -1 and 1
-
negative r = negative relationship
-
positive r = positive relationship
strength of r
which is stronger:
-.2 or +.1? -.5 or +.75?
absolute value of r indicates
strength
general guide for interpreting
strength of r (absolute value)
0 - .2 = weak, slight
.2 - .4 = mild/modest
.4 - .6 = moderate
.6 - .8 = moderately strong
.8 - 1.0 = strong
r standardizes the degree
of association, regardless of units of measurement
r appropriate for describing
for linear relationships only
restricted range on one or
both variables attenuates correlation
outliers influence correlation,
too
Ecological correlation
correlation between rates
or averages
units of analysis = some
kind of aggregate (e.g., neighborhoods, companies, states, countries)
interpret carefully; may
inflate degree of association between underlying conceptual variables
-
e.g., 1970 Census data -
income x education for individual men, r = .4
-
mean income x mean education
for each of 9 regions, r = .7
-
means/rates for aggregates
eliminate spread and thus increase clustering about line
-
case of Simpson's paradox
- aggregation across natural units changes nature of association
Nonlinear relationships
and transformations
one response: transform the
data on one or both variables to make relationship more linear
-
many different transformations
depending on the circumstances - logarithms, square roots, squaring, raising
to higher powers, etc.