*************************** * Biostatistics 513 * Exercise Set 4, 2002 *************************** 1(a) Using Y as the response and NEWIRON as a predictor leads to the logistic regression model: logit[ pi(X) ] = b0 + b1*NEWIRON where pi(X) is the probability that Y=1 for a given covariate value, X. 1(b) For this model the coefficient b1 is a log odds ratio. In fact, the odds of CHD among high iron (NEWIRON=1) subjects compared to the odds of CHD among low iron (NEWIRON=0) subjects is given by: exp( b1 ). Our crude analysis estimates this odds ratio as 1.475 (95% CI = 1.103, 1.973). 1(c) Fitting this model we obtain coefficient estimates, b0 and b1, and their standard errors: Logit estimates Number of obs = 908 LR chi2(1) = 6.83 Prob > chi2 = 0.0090 Log likelihood = -595.99383 Pseudo R2 = 0.0057 ------------------------------------------------------------------------------ case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- newiron | 1.475 .2187347 2.62 0.009 1.102969 1.972517 ------------------------------------------------------------------------------ . logit Logit estimates Number of obs = 908 LR chi2(1) = 6.83 Prob > chi2 = 0.0090 Log likelihood = -595.99383 Pseudo R2 = 0.0057 ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- newiron | .388658 .1482947 2.62 0.009 .0980057 .6793102 _cons | -.6418539 .0832934 -7.71 0.000 -.805106 -.4786017 ------------------------------------------------------------------------------ The easiest test of, H0: b1=0, is obtained from the Wald statistic in the second table above. This is the (estimate/s.e.) = 0.389/0.148 = 2.62. Since this test statistic is greater than the alpha=0.05 critical value for a standard normal (ie. 1.96). We reject the null hypothesis. This result is also clear from the confidence intervals: the confidence interval for the odds ratio (first table) does not contain 1.0; and the confidence interval for the log odds ratio (second table) does not contain 0.0. The likelihood ratio test is trickier here -- it is given by the LR chi2(1) statistic. We find a p-value of 0.009 and therefore would reject H0 at the 5% significance level. 1(d) Using AGE dummy variables: AGE(j) = 1 if AGE category=j, and 0 if not AGE category=j we have the model: logit[ pi(X) ] = b0 + b1*NEWIRON + b2*FEMALE + b3*AGE(2) + b4*AGE(3) + b5*AGE(4) Here we model the log odds of disease allowing a different rate for each age category and each gender, but within age and gender groups we have the same NEWIRON comparison (ie. there is a "common" NEWIRON odds ratio for the age and gender groups, or alternatively we can say that this model assumes no interaction between NEWIRON and either age or gender). 1(e) The coefficient of NEWIRON in this model has the interpretation: exp( b1 ) is the odds ratio comparing the odds of CHD among the subjects with high iron (NEWIRON=1) to the odds of a CHD among the low iron subjects, where the subjects being compared *are of the same gender and the same age category*. Alternatively, we can say that the exponentiated NEWIRON coefficient estimate, exp(0.0898)=1.094, is an estimate of comparison of high iron to low iron controlling for age and gender. 1(f) The coefficient for the AGE(3) variable in this model is: exp( b4 ) is the odds ratio comparing the odds of a CHD among age category 3 subjects to the odds of a CHD among age category 1 subjects, where the subjects being compared *are of the same gender and have the same iron consumption (ie. NEWIRON)*. Since we have NEWIRON and FEMALE in the model, the coefficient for AGE(3) "controls" for these factors -- that is, compares age=3 to age=1 for fixed values of NEWIRON and FEMALE. This leads to b4 as a comparison of the log odds of a CHD for: (NEWIRON=0, FEMALE=0, AGE=3) versus (NEWIRON=0, FEMALE=0, AGE=1), *or* (NEWIRON=1, FEMALE=0, AGE=3) versus (NEWIRON=1, FEMALE=0, AGE=1), *or* (NEWIRON=0, FEMALE=1, AGE=3) versus (NEWIRON=0, FEMALE=1, AGE=1), *or* (NEWIRON=1, FEMALE=1, AGE=3) versus (NEWIRON=1, FEMALE=1, AGE=1). 1(g) (I'll cut-and-paste 1f and substitute AGE(4) for AGE(3)): exp( b5 ) is the odds ratio comparing the odds of a CHD among age category 4 subjects to the odds of a CHD among age category 1 subjects, where the subjects being compared *are of the same gender and have the same iron consumption (ie. NEWIRON)*. Since we have NEWIRON and FEMALE in the model, the coefficient for AGE(4) "controls" for these factors -- that is, compares age=4 to age=1 for fixed values of NEWIRON and FEMALE. This leads to b4 as a comparison of the log odds of a CHD for: (NEWIRON=0, FEMALE=0, AGE=4) versus (NEWIRON=0, FEMALE=0, AGE=1), *or* (NEWIRON=1, FEMALE=0, AGE=4) versus (NEWIRON=1, FEMALE=0, AGE=1), *or* (NEWIRON=0, FEMALE=1, AGE=4) versus (NEWIRON=0, FEMALE=1, AGE=1), *or* (NEWIRON=1, FEMALE=1, AGE=4) versus (NEWIRON=1, FEMALE=1, AGE=1). 1(h) We obtain odds ratios after calculating the difference in the log odds and then exponentiating it: log odds for group 1 - log odds for group 2 = log ( odds for group 1 / odds for group 2 ) = log ( odds ratio comparing group 1 to group 2). For this example we can proceed as follows: log odds for (NEWIRON=0,FEMALE=0,AGE=4) = b0 + b5 log odds for (NEWIRON=0,FEMALE=0,AGE=3) = b0 + b4 log odds for (NEWIRON=0,FEMALE=0,AGE=4) - log odds for (NEWIRON=0,FEMALE=0,AGE=3) = b5 - b4 So that exp( b5 - b4 ) is the odds ratio comparing the odds of disease among AGE=4 low iron, female subjects to the odds of disease among AGE=3 low iron, female subjects. The following output is from the use of "lincom" that will produce estimates and confidence intervals for such combinations of parameters: . . ***** let's get STATA to calculate the comparison in 1(h) . lincom _Iage_4 - _Iage_3, or ( 1) - _Iage_3 + _Iage_4 = 0.0 ------------------------------------------------------------------------------ case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | .41756 .1144447 -3.19 0.001 .2440179 .7145226 ------------------------------------------------------------------------------ 1(i) We have the following fitted logistic regression models: (here just showing the "beta" coefficients, not the odds ratios (ie. exp( beta )'s ): Logit estimates Number of obs = 908 LR chi2(4) = 120.26 Prob > chi2 = 0.0000 Log likelihood = -539.27623 Pseudo R2 = 0.1003 ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | -1.632189 .1838605 -8.88 0.000 -1.992549 -1.271829 _Iage_2 | .6817876 .2027506 3.36 0.001 .2844036 1.079172 _Iage_3 | .8536424 .1990066 4.29 0.000 .4635965 1.243688 _Iage_4 | -.022819 .2911035 -0.08 0.938 -.5933713 .5477333 _cons | -.6085052 .1579415 -3.85 0.000 -.9180648 -.2989455 ------------------------------------------------------------------------------ . lrtest, saving(1) Logit estimates Number of obs = 908 LR chi2(5) = 120.58 Prob > chi2 = 0.0000 Log likelihood = -539.11767 Pseudo R2 = 0.1006 ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- newiron | .0898209 .1593785 0.56 0.573 -.2225551 .402197 female | -1.612586 .1870517 -8.62 0.000 -1.979201 -1.245972 _Iage_2 | .6850382 .2028911 3.38 0.001 .2873789 1.082698 _Iage_3 | .8589874 .1993287 4.31 0.000 .4683103 1.249664 _Iage_4 | -.0143397 .2916048 -0.05 0.961 -.5858747 .5571953 _cons | -.6455109 .1712649 -3.77 0.000 -.9811841 -.3098378 ------------------------------------------------------------------------------ . lrtest, saving(2) The likelihood ratio test is obtained by comparing the maximized log likelihoods: LR = 2 (logL2 - logL1) = 2( -539.12 - -539.28 ) = 0.32 Since the models being compared only differ by one parameter (one covariate is excluded -- NEWIRON) the degrees of freedom for this test is df=1. The null hypothesis is that the coefficient for the variable that is not in the richer model is 0: H0: b1=0 (where b1 is the coefficient of NEWIRON) H1: b1 is not equal to zero. . lrtest, using(2) model(1) Logistic: likelihood-ratio test chi2(1) = 0.32 Prob > chi2 = 0.5733 Comparing the LR statistic to a chi2(1) distribution we obtain the p-value, P[ chi2(1) > 0.32 ] = 0.573. Therefore, we fail to reject H0 and conclude that there is no significant iron effect, after adjusting for age and gender. 1(j) The estimated common odds ratio using logistic regression is exp(.0898) which is 1.094, 95% CI = ( 0.800, 1.495). Using Mantel-Haenszel we obtain a similar point estimate and confidence interval: OR estimate = 1.104, 95% CI = (0.809, 1.507). These estimates have exactly the same interpretation -- the comparison of the log odds of a disease for high versus low iron controlling for age and gender. 1(k) Apriori we may expect age to be related to disease and may or may not expect age to be related to iron consumption. If age is not related to iron then it would be a "precision" variable, while if age were related to both the outcome (disease) and the exposure (newiron) then it would be a potential confounder. We do find some evidence for an association between disease and age (note that 48% of cases are older than 59 while only 42% of controls are older than 59). Similarly we find some association between age and iron (46% of low iron subjects are older than 59 while 40% of high iron subjects are older than 59). However, both associations appear rather small. To assess the confounding influence we can compare the estimated effect of interest both adjusting and not adjusting for age: newiron OR in model with female and age: 1.094 newiron OR in model with female and not age: 1.074 Thus, we see a small change in the estimated odds ratio when we fail to control for age, indicating that adjusting for age has little qualitative impact on our conclusions regarding iron. 1(l) Female is clearly important to include in the model. Not only is gender strongly predictive of the outcome, but it is associated with the exposure of interest. Furthermore, the estimated crude iron odds ratio is 1.475 and statistically significant, however as shown in answer to question 1(k), controlling for gender alone reduces the estimated odds ratio to only 1.074, and the comparison is no longer statistically significant. Therefore, as controlling for gender as a large impact on the estimated iron odds ratio it is important to include in the logistic regression model. 1(m) In the following model we allow and interaction between newiron and female: logit[ pi(X) ] = b0 + b1*NEWIRON + b2*FEMALE + b3*AGE(2) + b4*AGE(3) + b5*AGE(4) + b6*NEWIRON*FEMALE log odds (NEWIRON=1, FEMALE=0, AGE=j ) = b0 + b1 + b(j+1) log odds (NEWIRON=0, FEMALE=0, AGE=j ) = b0 + b(j+1) log odds (NEWIRON=1, FEMALE=0, AGE=j ) - log odds (NEWIRON=0, FEMALE=0, AGE=j ) = b1 and log odds (NEWIRON=1, FEMALE=1, AGE=j ) = b0 + b1 + b2 + b(j+1) + b6 log odds (NEWIRON=0, FEMALE=1, AGE=j ) = b0 + b2 + b(j+1) log odds (NEWIRON=1, FEMALE=0, AGE=j ) - log odds (NEWIRON=0, FEMALE=0, AGE=j ) = b1 + b6 Therefore, exp( b1 ) is the odds ratio comparing high iron to low iron for males (FEMALE=0), controlling for age, and exp( b1+b6 ) is the odds ratio comparing high iron to low iron for females (FEMALE=1), controlling for age. If b6=0 then the OR for males is exp( b1 ) and the OR for females is then also exp( b1 + 0 ) = exp( b1 ). Therefore, the null hypothesis of no interaction, or common odds ratio, or homogeneity of odds ratios is given by: H0: b6=0 H1: b6 not equal to 0 1(n) Using logistic regression that includes: newiron female i.age and ironXgen (the product of newiron and female) leads to the estimate of b6 = 0.451, with a Z statistic (also called Wald statistic) of 0.451/0.455 = 0.99, p-value=0.321. Therefore, we fail to reject the null hypothesis that b6=0 and conclude that the age-adjusted OR for males is not significantly different than the age-adjusted OR for females. 2(a) Here we have the ordered categorical variable, ALC, that takes the values 1, 2, 3, 4, representing levels of alcohol use. The model: logit[ pi(X) ] = b0 + b1*ALC can be used to detect a trend in the odds of disease with increasing categories of ALC. Recall that earlier our trend test was based on the following: Let p(j) = the probability that Y=1 (here disease) for category j of a covariate (here ALC). H0: p(1)=p(2)=p(3)=p(4) H1: p(1)<=p(2)<=p(3)<=p(4) with one inequality strictly < , or p(1)>=p(2)>=p(3)>=p(4) with one inequality strictly > Notice that the alternative can be viewed in terms of odds ratios: p(1) <= p(2) <= p(3) <= p(4) implies p(1)/[1-p(1)] <= p(2)/[1-p(2)] <= p(3)/[1-p(3)] <= p(4)/[1-p(4)] that is, increasing probabilities implies increasing odds. Then, this becomes odds ratios if we divide each of these by p(1)/[1-p(1)]: 1 <= OR(2,1) <= OR(3,1) <= OR(4,1) where OR(j,k) is the odds ratio comparing category j to category k. Thus, the alternative hypothesis for the trend test is equivalent to looking for an increasing odds ratio (or increasing log odds ratio). We obtain the following model estimates: (only showing the "beta" estimates) Logit estimates Number of obs = 972 LR chi2(1) = 143.74 Prob > chi2 = 0.0000 Log likelihood = -422.18354 Pseudo R2 = 0.1455 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- alc | 1.044229 .0935454 11.163 0.000 .8608836 1.227575 _cons | -3.522645 .228131 -15.441 0.000 -3.969774 -3.075517 ------------------------------------------------------------------------------ A trend test in terms of the regression coefficient is: H0: b1=0 H1: b1 not equal to zero. We can use either the likelihood ratio test (given above as LR chi2(1), with a test statistic of 143.74 on df=1, p-value<0.001) or the Wald statistic (given above as 1.044/0.0935 = 11.163, p-value<0.001). Using either test we would reject the null hypothesis. 2(b) Next we fit a model that uses dummy variables. This can help us to "see" the increase in the log odds across the categories of the predictor: logit[ pi(X) ] = b0 + b1*ALC(2) + b2*ALC(3) + b4*ALC(4) The output from this fit is: Logit estimates Number of obs = 972 LR chi2(3) = 145.55 Prob > chi2 = 0.0000 Log likelihood = -421.27731 Pseudo R2 = 0.1473 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- Ialc_2 | 1.263438 .2323757 5.437 0.000 .8079897 1.718886 Ialc_3 | 2.046657 .2611432 7.837 0.000 1.534825 2.558488 Ialc_4 | 3.296359 .3236825 10.184 0.000 2.661953 3.930765 _cons | -2.580739 .1925972 -13.400 0.000 -2.958223 -2.203256 ------------------------------------------------------------------------------ Here we find that the estimate of b1 (ALC=2 versus ALC=1) is 1.26, the estimate of b2 is 2.05 (ALC=3 versus ALC=1) and the estimate of b3 is 3.30 (ALC=4 versus ALC=1). These show a very steady increase in the log odds as we move up the categories of ALC (approximate 1 unit increases!). This suggest that the linear model in 1(a) is good. "Is the model in (a) nested within this model?" -- Yes. The next model fit helps us to see that. 2(c) This model takes the linear function ALC, and then adds the dummy variables ALC(3) and ALC(4). This has the same number of parameters. Let's compare this to the dummy variable model log odds ratio dummy variables linear + (2) dummy ALC=1 b0 b0 + b1*(1) ALC=2 b0 + b1 b0 + b1*(2) ALC=3 b0 + b2 b0 + b1*(3) + b2 ALC=4 b0 + b3 b0 + b1*(4) + b3 *** It is useful to try and graph these models. *** Here is the fitted model: (log odds only) Logit estimates Number of obs = 972 LR chi2(3) = 145.55 Prob > chi2 = 0.0000 Log likelihood = -421.27731 Pseudo R2 = 0.1473 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- alc | 1.263438 .2323757 5.437 0.000 .8079897 1.718886 alc3 | -.4802187 .3685306 -1.303 0.193 -1.202525 .2420881 alc4 | -.4939538 .6067902 -0.814 0.416 -1.683241 .695333 _cons | -3.844177 .4065459 -9.456 0.000 -4.640992 -3.047362 ------------------------------------------------------------------------------ Notice that we have the following fitted log odds: ALC=1 -3.844 + 1.263*(1) = -2.581 ALC=2 -3.844 + 1.263*(2) = -1.318 = -2.581 + 1.263 ALC=3 -3.844 + 1.263*(3) -0.480 = -0.535 = -2.581 + 2.046 ALC=4 -3.844 + 1.263*(4) -0.494 = 0.714 = -2.581 + 3.295 This gives exactly the same fitted values (log odds ratios) as the dummy variable model! Clearly the linear model is nested within this model. We obtain the linear model as a special case where: H0: b2=b3=0 (coefficients of ALC(3) and ALC(4) are zero) H1: not both are zero. 2(d) A likelihood ratio test of the model in 1(c) versus the model in 1(a) yields a likelihood ratio statistic of: LR = 2*( -421.28 - -422.18 ) = 1.81. This has df=2 (2 parameters set to zero) and comparing the LR statistic to a chi2(df=2) yields: P[ chi2(df=2) > 19.8 ] = 0.40 Therefore we would not reject the null hypothesis and conclude that the linear model is adequate. 2(e) Age adjustment: Using the model with ALC as a linear term we can add age group dummy variables. The fitted model here is: Logit estimates Number of obs = 972 LR chi2(6) = 254.61 Prob > chi2 = 0.0000 Log likelihood = -366.74772 Pseudo R2 = 0.2577 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- alc | 1.093055 .1033092 10.580 0.000 .8905723 1.295537 Iage_2 | 1.580665 1.075439 1.470 0.142 -.5271578 3.688487 Iage_3 | 3.30387 1.031849 3.202 0.001 1.281483 5.326256 Iage_4 | 3.826048 1.027778 3.723 0.000 1.81164 5.840456 Iage_5 | 4.214519 1.03375 4.077 0.000 2.188407 6.240631 Iage_6 | 4.307893 1.083363 3.976 0.000 2.184542 6.431245 _cons | -6.982177 1.050306 -6.648 0.000 -9.040739 -4.923616 ------------------------------------------------------------------------------ The resulting age adjusted odds ratio for ALC is given by: exp( 1.093 ) = 2.983 This odds ratio is an estimate of the odds of disease for two groups that differ by 1 unit in ALC categories, where age is controlled (ie. held fixed). That is, this is the odds of disease comparing ALC=2 to ALC=1, or comparing ALC=3 to ALC=2, or comparing ALC=4 to ALC=3, for fixed values of age. The estimate for this odds ratio that did not adjust for age was given in 2(a) as exp( 1.044 ) = 2.841 Age adjustment results in only a minor change in the odds ratio estimate. This suggests that AGE does not confound ALC. 2(f) Inspection of the coefficients for the AGE dummy variables can be used to see if a linear model might be adequate for AGE. For these coefficients we find that there is a large change up until AGE category 4, but then the increase tails off. This hints that a linear model will not be adequate. Formally, we can test the linear model versus a dummy variable model using the likelihood ratio test. For this we use: H0: linear model -- log odds = b0 + b1*ALC + b2*AGE H1: dummy variable -- log odds = b0 + b1*ALC + b2*AGE(2) + b3*AGE(3) + b4*AGE(4) + b5*AGE(5) + b6*AGE(6) Fitting the null model (not shown) yields a maximized log likelihood of -375.60. The likelihood ratio test is then: LR = 2*( -366.74 - -375.60 ) = 17.71 Comparing this to a chi2(df=4) yields: P[ chi2(df=4) > 17.71 ] = 0.001 Therefore we would reject H0, the linear model, in favor of the dummy variable model. 2(g) Looking back at the grouping of alcohol use we find: ALC = 1 for <40 g/day of alcohol use ALC = 2 for 40-79 g/day ALC = 3 for 80-119 g/day ALC = 4 for 120+ g/day Using this, we can say that adjacent categories roughly represent 40 g/day changes in alcohol use. Therefore, since b1 (in model with b1*ALC category) represents the increase in the log odds for a 1 unit category change, we would expect a b1 increase to be associated with a 40 g/day increase in the continuous alcohol measurement. So, I'd predict that if we used the actual g/day of alcohol use as a predictor, then we'd find the new coefficient of alcohol to be about b1/40.