Regression
and Correlation
bivariate relationships between
interval scale variables
goals:
display associations graphically
predict d.v. from i.v.
measure strength and direction
of association
use scatterplot to display
bivariate interval data
Xaxis = i.v., Yaxis = d.v.

direction: positive (up to
the right), negative (down to the right)
using X to predict Y

line summarizes relationship
simple linear regression
equation: Y = a + bX

value of Y when line crosses
Yaxis (i.e., when X = 0)

change in Y with a unit change
in X (rise/run)

regression line minimizes
squared errors (least squares line)

SSE = sum of squared errors

errors in prediction = residuals
Issues to consider:
influence of outliers  outliers
can suppress or drive relationships
nonlinear relationships 
linear regression based on assumption of linearity
extrapolation  predicting
y based on regression equation and x outside of observed range of x values
 be careful!
heteroschedasticity  errors in prediction vary over range
of values
Correlation
strength of association 
degree of clustering about line
r^{2} as a PRE measure
 strength of association
Proportional Reduction
in Error (PRE)
PRE = (E_{1}  E_{2})
/ E_{1}

E_{1} = errors in
predicting d.v. based on distribution of d.v.

E_{2} = errors in
predicting d.v. when prediction is based on i.v.

how much can we reduce errors
in predicting d.v. by considering i.v.?

100% reduction (1.0)  perfect
prediction

0% reduction (0.0)  i.v.
does not help in predicting
r^{2}:
E1 = predict values of Y
based on Y bar (mean)
E_{2}: predict Y
based on X and regression line

sum of (Y  Y hat)^{2}
(deviation of observed Y from regression line, squared)
Correlation: r
Pearson correlation coefficient
or productmoment coefficient
indicates how closely observed
values fall around regression line/clustering about line & direction
of association
r = square root of r^{2}
and takes sign of slope
ranges between 1 and 1

negative r = negative relationship

positive r = positive relationship
strength of r
which is stronger:
.2 or +.1? .5 or +.75?
absolute value of r indicates
strength
general guide for interpreting
strength of r (absolute value)
0  .2 = weak, slight
.2  .4 = mild/modest
.4  .6 = moderate
.6  .8 = moderately strong
.8  1.0 = strong
r standardizes the degree
of association, regardless of units of measurement
one approach to computing
r:
standardize each case's value
on x and y:
for each case, multiply standardized
X value by standardized Y value, sum the products, and divide by
n  1
r appropriate for describing
for linear relationships only
restricted range on one or
both variables attenuates correlation
outliers influence correlation,
too
Ecological correlation
correlation between rates
or averages
units of analysis = some
kind of aggregate (e.g., neighborhoods, companies, states, countries)
interpret carefully; may
inflate degree of association between underlying conceptual variables

e.g., 1970 Census data 
income x education for individual men, r = .4

mean income x mean education
for each of 9 regions, r = .7

means/rates for aggregates
eliminate spread and thus increase clustering about line

case of Simpson's paradox
 aggregation across natural units changes nature of association
Nonlinear relationships
and transformations
one response: transform the
data on one or both variables to make relationship more linear
many different transformations
depending on the circumstances  logarithms, square roots, squaring, raising
to higher powers, etc.
Investigating spurious, intervening, & conditional
relationships with correlation
correlation of interest rAB

spurious relationship (C > A, C > B, no causal link
between A & B)

rCA & rCB should be > rAB

intervening relationship (A > C > B)

rAC & rCB should be > rAB

# bedrooms > house value > taxes on house

in this approach, can use theory to distinguish spurious
and intervening relationships

conditional/interactive relationship

rAB higher for some values of C than other values of C

compute rAB for cases defined by particular ranges of
C
more sophisticated techniques exist for evaluating these
interpretations; these are simple approaches