Regression and Correlation

• bivariate relationships between interval scale variables

•

goals:

• display associations graphically

•
• predict d.v. from i.v.

•
• measure strength and direction of association

•

• use scatterplot to display bivariate interval data

•
• X-axis = i.v., Y-axis = d.v.
• direction: positive (up to the right), negative (down to the right)
• using X to predict Y

•
• line summarizes relationship
• simple linear regression equation: Y = a + bX
• a = Y-intercept
• value of Y when line crosses Y-axis (i.e., when X = 0)
• b = slope

• change in Y with a unit change in X (rise/run)
• regression line minimizes squared errors (least squares line)
• SSE = sum of squared errors
• MSE = mean squared error
• errors in prediction = residuals

Issues to consider:

• influence of outliers - outliers can suppress or drive relationships

•

• nonlinear relationships - linear regression based on assumption of linearity

•

• extrapolation - predicting y based on regression equation and x outside of observed range of x values - be careful!

•
• heteroschedasticity - errors in prediction vary over range of values

•

Correlation

• strength of association - degree of clustering about line

•

• r2 as a PRE measure - strength of association

•

Proportional Reduction in Error (PRE)

• PRE = (E1 - E2) / E1

•
• E1 = errors in predicting d.v. based on distribution of d.v.
• E2 = errors in predicting d.v. when prediction is based on i.v.
• ranges between 0 and 1
• how much can we reduce errors in predicting d.v. by considering i.v.?
• 100% reduction (1.0) - perfect prediction
• 0% reduction (0.0) - i.v. does not help in predicting

r2:

• E1 = predict values of Y based on Y bar (mean)

•

• sum of (Y - Y bar)2
• E2: predict Y based on X and regression line

•
• sum of (Y - Y hat)2  (deviation of observed Y from regression line, squared)

Correlation: r

• Pearson correlation coefficient or product-moment coefficient

•
• indicates how closely observed values fall around regression line/clustering about line & direction of association

•
• r = square root of r2 and takes sign of slope

•
• ranges between -1 and 1
• negative r = negative relationship
• positive r = positive relationship
• 0 = no relationship
• strength of r

•
• which is stronger:  -.2 or +.1?  -.5 or +.75?

•

• absolute value of r indicates strength

•
• general guide for interpreting strength of r (absolute value)

•

0 - .2 =  weak, slight

.2 - .4 = mild/modest

.4 - .6 = moderate

.6 - .8 = moderately strong

.8 - 1.0 = strong

• r standardizes the degree of association, regardless of units of measurement

•
• one approach to computing r:
• standardize each case's value on x and y:
• (X - X bar) / s.d. on X
• (Y - Y bar) / s.d. on Y
• for each case, multiply standardized X value by standardized Y value, sum the products, and divide by   n - 1
• r appropriate for describing for linear relationships only

•

• restricted range on one or both variables attenuates correlation

•
• outliers influence correlation, too

•

Ecological correlation

• correlation between rates or averages

•
• units of analysis = some kind of aggregate (e.g., neighborhoods, companies, states, countries)

•
• interpret carefully; may inflate degree of association between underlying conceptual variables

•
• e.g., 1970 Census data - income x education for individual men, r = .4
• mean income x mean education for each of 9 regions, r = .7
• means/rates for aggregates eliminate spread and thus increase clustering about line
• case of Simpson's paradox - aggregation across natural units changes nature of association

Nonlinear relationships and transformations

• one response: transform the data on one or both variables to make relationship more linear

•
• many different transformations depending on the circumstances - logarithms, square roots, squaring, raising to higher powers, etc.
• Investigating spurious, intervening, & conditional relationships with correlation

• correlation of interest rAB

•
• spurious relationship (C --> A, C --> B, no causal link between A & B)
• rCA & rCB should be > rAB
• intervening relationship (A --> C --> B)
• rAC & rCB should be > rAB
• 1990 CA Census
• # bedrooms --> house value --> taxes on house
• r's:
• # bedrooms x taxes = .50
• # bedrooms x value = .55
• value x taxes = .81
• in this approach, can use theory to distinguish spurious and intervening relationships
• conditional/interactive relationship
• rAB higher for some values of C than other values of C
• compute rAB for cases defined by particular ranges of C
• more sophisticated techniques exist for evaluating these interpretations; these are simple approaches