***************************
* Biostatistics 513 * Exercise Set 4, 2002
***************************
1(a) Using Y as the response and NEWIRON as a predictor leads to the logistic
regression model:
logit[ pi(X) ] = b0 + b1*NEWIRON
where pi(X) is the probability that Y=1 for a given covariate value, X.
1(b) For this model the coefficient b1 is a log odds ratio. In fact,
the odds of CHD among high iron (NEWIRON=1) subjects compared to the
odds of CHD among low iron (NEWIRON=0) subjects is given by: exp( b1 ).
Our crude analysis estimates this odds ratio as 1.475 (95% CI = 1.103,
1.973).
1(c) Fitting this model we obtain coefficient estimates, b0 and b1, and their
standard errors:
Logit estimates Number of obs = 908
LR chi2(1) = 6.83
Prob > chi2 = 0.0090
Log likelihood = -595.99383 Pseudo R2 = 0.0057
------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
newiron | 1.475 .2187347 2.62 0.009 1.102969 1.972517
------------------------------------------------------------------------------
. logit
Logit estimates Number of obs = 908
LR chi2(1) = 6.83
Prob > chi2 = 0.0090
Log likelihood = -595.99383 Pseudo R2 = 0.0057
------------------------------------------------------------------------------
case | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
newiron | .388658 .1482947 2.62 0.009 .0980057 .6793102
_cons | -.6418539 .0832934 -7.71 0.000 -.805106 -.4786017
------------------------------------------------------------------------------
The easiest test of, H0: b1=0, is obtained from the Wald statistic in the
second table above. This is the (estimate/s.e.) = 0.389/0.148 = 2.62.
Since this test statistic is greater than the alpha=0.05 critical value for a
standard normal (ie. 1.96). We reject the null hypothesis.
This result is also clear from the confidence intervals: the confidence
interval for the odds ratio (first table) does not contain 1.0; and the
confidence interval for the log odds ratio (second table) does not
contain 0.0.
The likelihood ratio test is trickier here -- it is given by the LR chi2(1)
statistic. We find a p-value of 0.009 and therefore would reject H0 at
the 5% significance level.
1(d) Using AGE dummy variables:
AGE(j) = 1 if AGE category=j, and 0 if not AGE category=j
we have the model:
logit[ pi(X) ] = b0 + b1*NEWIRON
+ b2*FEMALE
+ b3*AGE(2)
+ b4*AGE(3)
+ b5*AGE(4)
Here we model the log odds of disease allowing a different
rate for each age category and each gender, but within age and gender groups
we have the same NEWIRON comparison (ie. there is a "common" NEWIRON odds ratio
for the age and gender groups, or alternatively we can say that this model
assumes no interaction between NEWIRON and either age or gender).
1(e) The coefficient of NEWIRON in this model has the interpretation:
exp( b1 ) is the odds ratio comparing the odds of CHD among the
subjects with high iron (NEWIRON=1) to the odds of a CHD among the
low iron subjects, where the subjects being compared *are of the same
gender and the same age category*. Alternatively, we can say that the
exponentiated NEWIRON coefficient estimate, exp(0.0898)=1.094, is an
estimate of comparison of high iron to low iron controlling for age
and gender.
1(f) The coefficient for the AGE(3) variable in this model is:
exp( b4 ) is the odds ratio comparing the odds of a CHD
among age category 3 subjects to the odds of a CHD among
age category 1 subjects, where the subjects being compared *are of the
same gender and have the same iron consumption (ie. NEWIRON)*. Since
we have NEWIRON and FEMALE in the model, the coefficient for
AGE(3) "controls" for these factors -- that is, compares age=3 to age=1
for fixed values of NEWIRON and FEMALE. This leads to b4 as a comparison
of the log odds of a CHD for:
(NEWIRON=0, FEMALE=0, AGE=3) versus (NEWIRON=0, FEMALE=0, AGE=1), *or*
(NEWIRON=1, FEMALE=0, AGE=3) versus (NEWIRON=1, FEMALE=0, AGE=1), *or*
(NEWIRON=0, FEMALE=1, AGE=3) versus (NEWIRON=0, FEMALE=1, AGE=1), *or*
(NEWIRON=1, FEMALE=1, AGE=3) versus (NEWIRON=1, FEMALE=1, AGE=1).
1(g) (I'll cut-and-paste 1f and substitute AGE(4) for AGE(3)):
exp( b5 ) is the odds ratio comparing the odds of a CHD
among age category 4 subjects to the odds of a CHD among
age category 1 subjects, where the subjects being compared *are of the
same gender and have the same iron consumption (ie. NEWIRON)*. Since
we have NEWIRON and FEMALE in the model, the coefficient for
AGE(4) "controls" for these factors -- that is, compares age=4 to age=1
for fixed values of NEWIRON and FEMALE. This leads to b4 as a comparison
of the log odds of a CHD for:
(NEWIRON=0, FEMALE=0, AGE=4) versus (NEWIRON=0, FEMALE=0, AGE=1), *or*
(NEWIRON=1, FEMALE=0, AGE=4) versus (NEWIRON=1, FEMALE=0, AGE=1), *or*
(NEWIRON=0, FEMALE=1, AGE=4) versus (NEWIRON=0, FEMALE=1, AGE=1), *or*
(NEWIRON=1, FEMALE=1, AGE=4) versus (NEWIRON=1, FEMALE=1, AGE=1).
1(h) We obtain odds ratios after calculating the difference in the log odds
and then exponentiating it:
log odds for group 1 - log odds for group 2 =
log ( odds for group 1 / odds for group 2 ) =
log ( odds ratio comparing group 1 to group 2).
For this example we can proceed as follows:
log odds for (NEWIRON=0,FEMALE=0,AGE=4) = b0 + b5
log odds for (NEWIRON=0,FEMALE=0,AGE=3) = b0 + b4
log odds for (NEWIRON=0,FEMALE=0,AGE=4)
- log odds for (NEWIRON=0,FEMALE=0,AGE=3) = b5 - b4
So that exp( b5 - b4 ) is the odds ratio comparing the odds of disease
among AGE=4 low iron, female subjects to the odds of disease among AGE=3
low iron, female subjects.
The following output is from the use of "lincom" that will produce estimates
and confidence intervals for such combinations of parameters:
.
. ***** let's get STATA to calculate the comparison in 1(h)
. lincom _Iage_4 - _Iage_3, or
( 1) - _Iage_3 + _Iage_4 = 0.0
------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | .41756 .1144447 -3.19 0.001 .2440179 .7145226
------------------------------------------------------------------------------
1(i) We have the following fitted logistic regression models: (here just
showing the "beta" coefficients, not the odds ratios (ie. exp( beta )'s ):
Logit estimates Number of obs = 908
LR chi2(4) = 120.26
Prob > chi2 = 0.0000
Log likelihood = -539.27623 Pseudo R2 = 0.1003
------------------------------------------------------------------------------
case | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | -1.632189 .1838605 -8.88 0.000 -1.992549 -1.271829
_Iage_2 | .6817876 .2027506 3.36 0.001 .2844036 1.079172
_Iage_3 | .8536424 .1990066 4.29 0.000 .4635965 1.243688
_Iage_4 | -.022819 .2911035 -0.08 0.938 -.5933713 .5477333
_cons | -.6085052 .1579415 -3.85 0.000 -.9180648 -.2989455
------------------------------------------------------------------------------
. lrtest, saving(1)
Logit estimates Number of obs = 908
LR chi2(5) = 120.58
Prob > chi2 = 0.0000
Log likelihood = -539.11767 Pseudo R2 = 0.1006
------------------------------------------------------------------------------
case | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
newiron | .0898209 .1593785 0.56 0.573 -.2225551 .402197
female | -1.612586 .1870517 -8.62 0.000 -1.979201 -1.245972
_Iage_2 | .6850382 .2028911 3.38 0.001 .2873789 1.082698
_Iage_3 | .8589874 .1993287 4.31 0.000 .4683103 1.249664
_Iage_4 | -.0143397 .2916048 -0.05 0.961 -.5858747 .5571953
_cons | -.6455109 .1712649 -3.77 0.000 -.9811841 -.3098378
------------------------------------------------------------------------------
. lrtest, saving(2)
The likelihood ratio test is obtained by comparing the maximized log
likelihoods:
LR = 2 (logL2 - logL1) = 2( -539.12 - -539.28 ) = 0.32
Since the models being compared only differ by one parameter (one covariate
is excluded -- NEWIRON) the degrees of freedom for this test is df=1.
The null hypothesis is that the coefficient for the variable that is not in the
richer model is 0:
H0: b1=0 (where b1 is the coefficient of NEWIRON)
H1: b1 is not equal to zero.
. lrtest, using(2) model(1)
Logistic: likelihood-ratio test chi2(1) = 0.32
Prob > chi2 = 0.5733
Comparing the LR statistic to a chi2(1) distribution we obtain the p-value,
P[ chi2(1) > 0.32 ] = 0.573. Therefore, we fail to reject H0 and conclude
that there is no significant iron effect, after adjusting for age and gender.
1(j) The estimated common odds ratio using logistic regression is exp(.0898)
which is 1.094, 95% CI = ( 0.800, 1.495). Using Mantel-Haenszel we obtain a
similar point estimate and confidence interval: OR estimate = 1.104,
95% CI = (0.809, 1.507). These estimates have exactly the same
interpretation -- the comparison of the log odds of a disease
for high versus low iron controlling for age and gender.
1(k) Apriori we may expect age to be related to disease and may or may not
expect age to be related to iron consumption. If age is not related to iron
then it would be a "precision" variable, while if age were related to both the
outcome (disease) and the exposure (newiron) then it would be a potential
confounder. We do find some evidence for an association between disease and
age (note that 48% of cases are older than 59 while only 42% of controls are
older than 59). Similarly we find some association between age and iron
(46% of low iron subjects are older than 59 while 40% of high iron subjects
are older than 59). However, both associations appear rather small.
To assess the confounding influence we can compare the estimated effect
of interest both adjusting and not adjusting for age:
newiron OR in model with female and age: 1.094
newiron OR in model with female and not age: 1.074
Thus, we see a small change in the estimated odds ratio when we fail to
control for age, indicating that adjusting for age has little qualitative
impact on our conclusions regarding iron.
1(l) Female is clearly important to include in the model. Not only is
gender strongly predictive of the outcome, but it is associated with the
exposure of interest. Furthermore, the estimated crude iron odds ratio
is 1.475 and statistically significant, however as shown in answer to
question 1(k), controlling for gender alone reduces the estimated odds
ratio to only 1.074, and the comparison is no longer statistically
significant. Therefore, as controlling for gender as a large impact
on the estimated iron odds ratio it is important to include in the logistic
regression model.
1(m) In the following model we allow and interaction between newiron
and female:
logit[ pi(X) ] = b0 + b1*NEWIRON
+ b2*FEMALE
+ b3*AGE(2)
+ b4*AGE(3)
+ b5*AGE(4)
+ b6*NEWIRON*FEMALE
log odds (NEWIRON=1, FEMALE=0, AGE=j ) = b0 + b1 + b(j+1)
log odds (NEWIRON=0, FEMALE=0, AGE=j ) = b0 + b(j+1)
log odds (NEWIRON=1, FEMALE=0, AGE=j ) -
log odds (NEWIRON=0, FEMALE=0, AGE=j ) = b1
and
log odds (NEWIRON=1, FEMALE=1, AGE=j ) = b0 + b1 + b2 + b(j+1) + b6
log odds (NEWIRON=0, FEMALE=1, AGE=j ) = b0 + b2 + b(j+1)
log odds (NEWIRON=1, FEMALE=0, AGE=j ) -
log odds (NEWIRON=0, FEMALE=0, AGE=j ) = b1 + b6
Therefore, exp( b1 ) is the odds ratio comparing high iron to low
iron for males (FEMALE=0), controlling for age, and exp( b1+b6 ) is
the odds ratio comparing high iron to low iron for females (FEMALE=1),
controlling for age.
If b6=0 then the OR for males is exp( b1 ) and the OR for females is
then also exp( b1 + 0 ) = exp( b1 ). Therefore, the null hypothesis
of no interaction, or common odds ratio, or homogeneity of odds ratios
is given by:
H0: b6=0
H1: b6 not equal to 0
1(n) Using logistic regression that includes: newiron female i.age
and ironXgen (the product of newiron and female) leads to the
estimate of b6 = 0.451, with a Z statistic (also called Wald statistic)
of 0.451/0.455 = 0.99, p-value=0.321. Therefore, we fail to reject
the null hypothesis that b6=0 and conclude that the age-adjusted OR for
males is not significantly different than the age-adjusted OR for females.
2(a) Here we have the ordered categorical variable, ALC, that takes the
values 1, 2, 3, 4, representing levels of alcohol use.
The model:
logit[ pi(X) ] = b0 + b1*ALC
can be used to detect a trend in the odds of disease with increasing categories
of ALC.
Recall that earlier our trend test was based on the following:
Let p(j) = the probability that Y=1 (here disease) for category j of a
covariate (here ALC).
H0: p(1)=p(2)=p(3)=p(4)
H1: p(1)<=p(2)<=p(3)<=p(4) with one inequality strictly < , or
p(1)>=p(2)>=p(3)>=p(4) with one inequality strictly >
Notice that the alternative can be viewed in terms of odds ratios:
p(1) <= p(2) <= p(3) <= p(4) implies
p(1)/[1-p(1)] <= p(2)/[1-p(2)] <= p(3)/[1-p(3)] <= p(4)/[1-p(4)]
that is, increasing probabilities implies increasing odds. Then, this
becomes odds ratios if we divide each of these by p(1)/[1-p(1)]:
1 <= OR(2,1) <= OR(3,1) <= OR(4,1) where
OR(j,k) is the odds ratio comparing category j to category k.
Thus, the alternative hypothesis for the trend test is equivalent to
looking for an increasing odds ratio (or increasing log odds ratio).
We obtain the following model estimates: (only showing the "beta" estimates)
Logit estimates Number of obs = 972
LR chi2(1) = 143.74
Prob > chi2 = 0.0000
Log likelihood = -422.18354 Pseudo R2 = 0.1455
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
alc | 1.044229 .0935454 11.163 0.000 .8608836 1.227575
_cons | -3.522645 .228131 -15.441 0.000 -3.969774 -3.075517
------------------------------------------------------------------------------
A trend test in terms of the regression coefficient is:
H0: b1=0
H1: b1 not equal to zero.
We can use either the likelihood ratio test (given above as LR chi2(1),
with a test statistic of 143.74 on df=1, p-value<0.001) or the Wald
statistic (given above as 1.044/0.0935 = 11.163, p-value<0.001). Using
either test we would reject the null hypothesis.
2(b) Next we fit a model that uses dummy variables. This can help us to
"see" the increase in the log odds across the categories of the predictor:
logit[ pi(X) ] = b0 + b1*ALC(2) + b2*ALC(3) + b4*ALC(4)
The output from this fit is:
Logit estimates Number of obs = 972
LR chi2(3) = 145.55
Prob > chi2 = 0.0000
Log likelihood = -421.27731 Pseudo R2 = 0.1473
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
Ialc_2 | 1.263438 .2323757 5.437 0.000 .8079897 1.718886
Ialc_3 | 2.046657 .2611432 7.837 0.000 1.534825 2.558488
Ialc_4 | 3.296359 .3236825 10.184 0.000 2.661953 3.930765
_cons | -2.580739 .1925972 -13.400 0.000 -2.958223 -2.203256
------------------------------------------------------------------------------
Here we find that the estimate of b1 (ALC=2 versus ALC=1) is 1.26, the
estimate of b2 is 2.05 (ALC=3 versus ALC=1) and the estimate of b3 is
3.30 (ALC=4 versus ALC=1). These show a very steady increase in the log
odds as we move up the categories of ALC (approximate 1 unit increases!).
This suggest that the linear model in 1(a) is good.
"Is the model in (a) nested within this model?" -- Yes. The next model
fit helps us to see that.
2(c) This model takes the linear function ALC, and then adds the dummy
variables ALC(3) and ALC(4). This has the same number of parameters. Let's
compare this to the dummy variable model
log odds ratio
dummy variables linear + (2) dummy
ALC=1 b0 b0 + b1*(1)
ALC=2 b0 + b1 b0 + b1*(2)
ALC=3 b0 + b2 b0 + b1*(3) + b2
ALC=4 b0 + b3 b0 + b1*(4) + b3
*** It is useful to try and graph these models. ***
Here is the fitted model: (log odds only)
Logit estimates Number of obs = 972
LR chi2(3) = 145.55
Prob > chi2 = 0.0000
Log likelihood = -421.27731 Pseudo R2 = 0.1473
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
alc | 1.263438 .2323757 5.437 0.000 .8079897 1.718886
alc3 | -.4802187 .3685306 -1.303 0.193 -1.202525 .2420881
alc4 | -.4939538 .6067902 -0.814 0.416 -1.683241 .695333
_cons | -3.844177 .4065459 -9.456 0.000 -4.640992 -3.047362
------------------------------------------------------------------------------
Notice that we have the following fitted log odds:
ALC=1 -3.844 + 1.263*(1) = -2.581
ALC=2 -3.844 + 1.263*(2) = -1.318 = -2.581 + 1.263
ALC=3 -3.844 + 1.263*(3) -0.480 = -0.535 = -2.581 + 2.046
ALC=4 -3.844 + 1.263*(4) -0.494 = 0.714 = -2.581 + 3.295
This gives exactly the same fitted values (log odds ratios) as the dummy
variable model!
Clearly the linear model is nested within this model. We obtain the linear
model as a special case where:
H0: b2=b3=0 (coefficients of ALC(3) and ALC(4) are zero)
H1: not both are zero.
2(d) A likelihood ratio test of the model in 1(c) versus the model in 1(a)
yields a likelihood ratio statistic of:
LR = 2*( -421.28 - -422.18 ) = 1.81.
This has df=2 (2 parameters set to zero) and comparing the LR statistic to
a chi2(df=2) yields:
P[ chi2(df=2) > 19.8 ] = 0.40
Therefore we would not reject the null hypothesis and conclude that the linear
model is adequate.
2(e) Age adjustment: Using the model with ALC as a linear term we can add
age group dummy variables. The fitted model here is:
Logit estimates Number of obs = 972
LR chi2(6) = 254.61
Prob > chi2 = 0.0000
Log likelihood = -366.74772 Pseudo R2 = 0.2577
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
alc | 1.093055 .1033092 10.580 0.000 .8905723 1.295537
Iage_2 | 1.580665 1.075439 1.470 0.142 -.5271578 3.688487
Iage_3 | 3.30387 1.031849 3.202 0.001 1.281483 5.326256
Iage_4 | 3.826048 1.027778 3.723 0.000 1.81164 5.840456
Iage_5 | 4.214519 1.03375 4.077 0.000 2.188407 6.240631
Iage_6 | 4.307893 1.083363 3.976 0.000 2.184542 6.431245
_cons | -6.982177 1.050306 -6.648 0.000 -9.040739 -4.923616
------------------------------------------------------------------------------
The resulting age adjusted odds ratio for ALC is given by:
exp( 1.093 ) = 2.983
This odds ratio is an estimate of the odds of disease for two groups that
differ by 1 unit in ALC categories, where age is controlled (ie. held fixed).
That is, this is the odds of disease comparing ALC=2 to ALC=1, or comparing
ALC=3 to ALC=2, or comparing ALC=4 to ALC=3, for fixed values of age.
The estimate for this odds ratio that did not adjust for age was given in
2(a) as
exp( 1.044 ) = 2.841
Age adjustment results in only a minor change in the odds ratio estimate. This
suggests that AGE does not confound ALC.
2(f) Inspection of the coefficients for the AGE dummy variables can be used
to see if a linear model might be adequate for AGE. For these coefficients
we find that there is a large change up until AGE category 4, but then the
increase tails off. This hints that a linear model will not be adequate.
Formally, we can test the linear model versus a dummy variable model using
the likelihood ratio test. For this we use:
H0: linear model -- log odds = b0 + b1*ALC + b2*AGE
H1: dummy variable -- log odds = b0 + b1*ALC + b2*AGE(2)
+ b3*AGE(3)
+ b4*AGE(4)
+ b5*AGE(5)
+ b6*AGE(6)
Fitting the null model (not shown) yields a maximized log likelihood of
-375.60.
The likelihood ratio test is then:
LR = 2*( -366.74 - -375.60 ) = 17.71
Comparing this to a chi2(df=4) yields:
P[ chi2(df=4) > 17.71 ] = 0.001
Therefore we would reject H0, the linear model, in favor of the dummy variable
model.
2(g) Looking back at the grouping of alcohol use we find:
ALC = 1 for <40 g/day of alcohol use
ALC = 2 for 40-79 g/day
ALC = 3 for 80-119 g/day
ALC = 4 for 120+ g/day
Using this, we can say that adjacent categories roughly represent 40 g/day
changes in alcohol use.
Therefore, since b1 (in model with b1*ALC category) represents the increase
in the log odds for a 1 unit category change, we would expect a b1 increase
to be associated with a 40 g/day increase in the continuous alcohol
measurement.
So, I'd predict that if we used the actual g/day of alcohol use as a
predictor, then we'd find the new coefficient of alcohol to
be about b1/40.