Regression
and Correlation
bivariate relationships between
interval scale variables
goals:
display associations graphically
predict d.v. from i.v.
measure strength and direction
of association
use scatterplot to display
bivariate interval data
X-axis = i.v., Y-axis = d.v.
-
direction: positive (up to
the right), negative (down to the right)
using X to predict Y
-
line summarizes relationship
simple linear regression
equation: Y = a + bX
-
value of Y when line crosses
Y-axis (i.e., when X = 0)
-
change in Y with a unit change
in X (rise/run)
-
regression line minimizes
squared errors (least squares line)
-
SSE = sum of squared errors
-
errors in prediction = residuals
Issues to consider:
influence of outliers - outliers
can suppress or drive relationships
nonlinear relationships -
linear regression based on assumption of linearity
extrapolation - predicting
y based on regression equation and x outside of observed range of x values
- be careful!
heteroschedasticity - errors in prediction vary over range
of values
Correlation
strength of association -
degree of clustering about line
r2 as a PRE measure
- strength of association
Proportional Reduction
in Error (PRE)
PRE = (E1 - E2)
/ E1
-
E1 = errors in
predicting d.v. based on distribution of d.v.
-
E2 = errors in
predicting d.v. when prediction is based on i.v.
-
how much can we reduce errors
in predicting d.v. by considering i.v.?
-
100% reduction (1.0) - perfect
prediction
-
0% reduction (0.0) - i.v.
does not help in predicting
r2:
E1 = predict values of Y
based on Y bar (mean)
E2: predict Y
based on X and regression line
-
sum of (Y - Y hat)2
(deviation of observed Y from regression line, squared)
Correlation: r
Pearson correlation coefficient
or product-moment coefficient
indicates how closely observed
values fall around regression line/clustering about line & direction
of association
r = square root of r2
and takes sign of slope
ranges between -1 and 1
-
negative r = negative relationship
-
positive r = positive relationship
strength of r
which is stronger:
-.2 or +.1? -.5 or +.75?
absolute value of r indicates
strength
general guide for interpreting
strength of r (absolute value)
0 - .2 = weak, slight
.2 - .4 = mild/modest
.4 - .6 = moderate
.6 - .8 = moderately strong
.8 - 1.0 = strong
r standardizes the degree
of association, regardless of units of measurement
one approach to computing
r:
standardize each case's value
on x and y:
for each case, multiply standardized
X value by standardized Y value, sum the products, and divide by
n - 1
r appropriate for describing
for linear relationships only
restricted range on one or
both variables attenuates correlation
outliers influence correlation,
too
Ecological correlation
correlation between rates
or averages
units of analysis = some
kind of aggregate (e.g., neighborhoods, companies, states, countries)
interpret carefully; may
inflate degree of association between underlying conceptual variables
-
e.g., 1970 Census data -
income x education for individual men, r = .4
-
mean income x mean education
for each of 9 regions, r = .7
-
means/rates for aggregates
eliminate spread and thus increase clustering about line
-
case of Simpson's paradox
- aggregation across natural units changes nature of association
Nonlinear relationships
and transformations
one response: transform the
data on one or both variables to make relationship more linear
many different transformations
depending on the circumstances - logarithms, square roots, squaring, raising
to higher powers, etc.
Investigating spurious, intervening, & conditional
relationships with correlation
correlation of interest rAB
-
spurious relationship (C --> A, C --> B, no causal link
between A & B)
-
rCA & rCB should be > rAB
-
intervening relationship (A --> C --> B)
-
rAC & rCB should be > rAB
-
# bedrooms --> house value --> taxes on house
-
in this approach, can use theory to distinguish spurious
and intervening relationships
-
conditional/interactive relationship
-
rAB higher for some values of C than other values of C
-
compute rAB for cases defined by particular ranges of
C
more sophisticated techniques exist for evaluating these
interpretations; these are simple approaches