***************************
*  Biostatistics 513      *                    Exercise Set 4, 2002
***************************


1(a)  Using Y as the response and NEWIRON as a predictor leads to the logistic
regression model:

     logit[ pi(X) ] = b0 + b1*NEWIRON

where pi(X) is the probability that Y=1 for a given covariate value, X.


1(b)  For this model the coefficient b1 is a log odds ratio.  In fact,
the odds of CHD among high iron (NEWIRON=1) subjects compared to the
odds of CHD among low iron (NEWIRON=0) subjects is given by:  exp( b1 ).
Our crude analysis estimates this odds ratio as 1.475 (95% CI = 1.103,
1.973).


1(c)  Fitting this model we obtain coefficient estimates, b0 and b1, and their
standard errors:

Logit estimates                                   Number of obs   =        908
                                                  LR chi2(1)      =       6.83
                                                  Prob > chi2     =     0.0090
Log likelihood = -595.99383                       Pseudo R2       =     0.0057

------------------------------------------------------------------------------
        case | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     newiron |      1.475   .2187347     2.62   0.009     1.102969    1.972517
------------------------------------------------------------------------------

. logit

Logit estimates                                   Number of obs   =        908
                                                  LR chi2(1)      =       6.83
                                                  Prob > chi2     =     0.0090
Log likelihood = -595.99383                       Pseudo R2       =     0.0057

------------------------------------------------------------------------------
        case |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     newiron |    .388658   .1482947     2.62   0.009     .0980057    .6793102
       _cons |  -.6418539   .0832934    -7.71   0.000     -.805106   -.4786017
------------------------------------------------------------------------------


The easiest test of, H0:  b1=0, is obtained from the Wald statistic in the
second table above.  This is the (estimate/s.e.) = 0.389/0.148 = 2.62.  
Since this test statistic is greater than the alpha=0.05 critical value for a 
standard normal (ie. 1.96).  We reject the null hypothesis.
This result is also clear from the confidence intervals:  the confidence 
interval for the odds ratio (first table) does not contain 1.0; and the 
confidence interval for the log odds ratio (second table) does not
contain 0.0.

The likelihood ratio test is trickier here -- it is given by the LR chi2(1)
statistic.  We find a p-value of 0.009 and therefore would reject H0 at
the 5% significance level.


1(d)  Using AGE dummy variables:

     AGE(j) = 1 if AGE category=j, and 0 if not AGE category=j

we have the model:

     logit[ pi(X) ] = b0 + b1*NEWIRON 
                         + b2*FEMALE
                         + b3*AGE(2)
                         + b4*AGE(3)
                         + b5*AGE(4)

Here we model the log odds of disease allowing a different
rate for each age category and each gender, but within age and gender groups
we have the same NEWIRON comparison (ie. there is a "common" NEWIRON odds ratio
for the age and gender groups, or alternatively we can say that this model
assumes no interaction between NEWIRON and either age or gender).


1(e)   The coefficient of NEWIRON in this model has the interpretation:  
exp( b1 ) is the odds ratio comparing the odds of CHD among the 
subjects with high iron (NEWIRON=1) to the odds of a CHD among the 
low iron subjects, where the subjects being compared *are of the same
gender and the same age category*.  Alternatively, we can say that the 
exponentiated NEWIRON coefficient estimate, exp(0.0898)=1.094, is an
estimate of comparison of high iron to low iron controlling for age
and gender.


1(f)  The coefficient for the AGE(3) variable in this model is:
exp( b4 ) is the odds ratio comparing the odds of a CHD
among age category 3 subjects to the odds of a CHD among
age category 1 subjects, where the subjects being compared *are of the
same gender and have the same iron consumption (ie. NEWIRON)*.  Since 
we have NEWIRON and FEMALE in the model, the coefficient for
AGE(3) "controls" for these factors  -- that is, compares age=3 to age=1
for fixed values of NEWIRON and FEMALE.  This leads to b4 as a comparison 
of the log odds of a CHD for:
(NEWIRON=0, FEMALE=0, AGE=3) versus (NEWIRON=0, FEMALE=0, AGE=1), *or*
(NEWIRON=1, FEMALE=0, AGE=3) versus (NEWIRON=1, FEMALE=0, AGE=1), *or*
(NEWIRON=0, FEMALE=1, AGE=3) versus (NEWIRON=0, FEMALE=1, AGE=1), *or*
(NEWIRON=1, FEMALE=1, AGE=3) versus (NEWIRON=1, FEMALE=1, AGE=1).


1(g)  (I'll cut-and-paste 1f and substitute AGE(4) for AGE(3)):
exp( b5 ) is the odds ratio comparing the odds of a CHD
among age category 4 subjects to the odds of a CHD among
age category 1 subjects, where the subjects being compared *are of the
same gender and have the same iron consumption (ie. NEWIRON)*.  Since 
we have NEWIRON and FEMALE in the model, the coefficient for
AGE(4) "controls" for these factors  -- that is, compares age=4 to age=1
for fixed values of NEWIRON and FEMALE.  This leads to b4 as a comparison 
of the log odds of a CHD for:
(NEWIRON=0, FEMALE=0, AGE=4) versus (NEWIRON=0, FEMALE=0, AGE=1), *or*
(NEWIRON=1, FEMALE=0, AGE=4) versus (NEWIRON=1, FEMALE=0, AGE=1), *or*
(NEWIRON=0, FEMALE=1, AGE=4) versus (NEWIRON=0, FEMALE=1, AGE=1), *or*
(NEWIRON=1, FEMALE=1, AGE=4) versus (NEWIRON=1, FEMALE=1, AGE=1).


1(h)  We obtain odds ratios after calculating the difference in the log odds
and then exponentiating it:

log odds for group 1  -  log odds for group 2  = 

log ( odds for group 1 / odds for group 2 ) =

log ( odds ratio comparing group 1 to group 2).

For this example we can proceed as follows:

log odds for (NEWIRON=0,FEMALE=0,AGE=4) = b0 + b5
log odds for (NEWIRON=0,FEMALE=0,AGE=3) = b0 + b4

log odds for (NEWIRON=0,FEMALE=0,AGE=4) 
- log odds for (NEWIRON=0,FEMALE=0,AGE=3)  = b5 - b4

So that exp( b5 - b4 ) is the odds ratio comparing the odds of disease
among AGE=4 low iron, female subjects to the odds of disease among AGE=3 
low iron, female subjects.

The following output is from the use of "lincom" that will produce estimates
and confidence intervals for such combinations of parameters:
. 
. ***** let's get STATA to calculate the comparison in 1(h)
. lincom _Iage_4 - _Iage_3, or

 ( 1) - _Iage_3 + _Iage_4 = 0.0

------------------------------------------------------------------------------
        case | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |     .41756   .1144447    -3.19   0.001     .2440179    .7145226
------------------------------------------------------------------------------


1(i)  We have the following fitted logistic regression models:  (here just
showing the "beta" coefficients, not the odds ratios (ie. exp( beta )'s ):

Logit estimates                                   Number of obs   =        908
                                                  LR chi2(4)      =     120.26
                                                  Prob > chi2     =     0.0000
Log likelihood = -539.27623                       Pseudo R2       =     0.1003

------------------------------------------------------------------------------
        case |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |  -1.632189   .1838605    -8.88   0.000    -1.992549   -1.271829
     _Iage_2 |   .6817876   .2027506     3.36   0.001     .2844036    1.079172
     _Iage_3 |   .8536424   .1990066     4.29   0.000     .4635965    1.243688
     _Iage_4 |   -.022819   .2911035    -0.08   0.938    -.5933713    .5477333
       _cons |  -.6085052   .1579415    -3.85   0.000    -.9180648   -.2989455
------------------------------------------------------------------------------
. lrtest, saving(1)

Logit estimates                                   Number of obs   =        908
                                                  LR chi2(5)      =     120.58
                                                  Prob > chi2     =     0.0000
Log likelihood = -539.11767                       Pseudo R2       =     0.1006

------------------------------------------------------------------------------
        case |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     newiron |   .0898209   .1593785     0.56   0.573    -.2225551     .402197
      female |  -1.612586   .1870517    -8.62   0.000    -1.979201   -1.245972
     _Iage_2 |   .6850382   .2028911     3.38   0.001     .2873789    1.082698
     _Iage_3 |   .8589874   .1993287     4.31   0.000     .4683103    1.249664
     _Iage_4 |  -.0143397   .2916048    -0.05   0.961    -.5858747    .5571953
       _cons |  -.6455109   .1712649    -3.77   0.000    -.9811841   -.3098378
------------------------------------------------------------------------------
. lrtest, saving(2)

The likelihood ratio test is obtained by comparing the maximized log 
likelihoods:

LR = 2 (logL2 - logL1) = 2( -539.12 - -539.28 ) = 0.32

Since the models being compared only differ by one parameter (one covariate
is excluded -- NEWIRON) the degrees of freedom for this test is df=1.

The null hypothesis is that the coefficient for the variable that is not in the
richer model is 0:

   H0:  b1=0 (where b1 is the coefficient of NEWIRON)
   H1:  b1 is not equal to zero.

. lrtest, using(2) model(1)
Logistic:  likelihood-ratio test                      chi2(1)     =       0.32
                                                      Prob > chi2 =     0.5733

Comparing the LR statistic to a chi2(1) distribution we obtain the p-value,
P[ chi2(1) > 0.32 ] = 0.573.  Therefore, we fail to reject H0 and conclude 
that there is no significant iron effect, after adjusting for age and gender.


1(j) The estimated common odds ratio using logistic regression is exp(.0898)
which is 1.094, 95% CI = ( 0.800, 1.495).  Using Mantel-Haenszel we obtain a 
similar point estimate and confidence interval:  OR estimate = 1.104, 
95% CI = (0.809, 1.507).  These estimates have exactly the same 
interpretation -- the comparison of the log odds of a disease 
for high versus low iron controlling for age and gender.


1(k) Apriori we may expect age to be related to disease and may or may not
expect age to be related to iron consumption.  If age is not related to iron 
then it would be a "precision" variable, while if age were related to both the
outcome (disease) and the exposure (newiron) then it would be a potential
confounder.  We do find some evidence for an association between disease and
age (note that 48% of cases are older than 59 while only 42% of controls are
older than 59).  Similarly we find some association between age and iron
(46% of low iron subjects are older than 59 while 40% of high iron subjects
are older than 59).  However, both associations appear rather small.
To assess the confounding influence we can compare the estimated effect
of interest both adjusting and not adjusting for age:

    newiron OR in model with female and age:        1.094
    newiron OR in model with female and not age:    1.074

Thus, we see a small change in the estimated odds ratio when we fail to
control for age, indicating that adjusting for age has little qualitative
impact on our conclusions regarding iron.

1(l)  Female is clearly important to include in the model.  Not only is
gender strongly predictive of the outcome, but it is associated with the
exposure of interest.  Furthermore, the estimated crude iron odds ratio
is 1.475 and statistically significant, however as shown in answer to 
question 1(k), controlling for gender alone reduces the estimated odds
ratio to only 1.074, and the comparison is no longer statistically 
significant.  Therefore, as controlling for gender as a large impact
on the estimated iron odds ratio it is important to include in the logistic
regression model.

1(m)  In the following model we allow and interaction between newiron
and female:

     logit[ pi(X) ] = b0 + b1*NEWIRON 
                         + b2*FEMALE
                         + b3*AGE(2)
                         + b4*AGE(3)
                         + b5*AGE(4)
                         + b6*NEWIRON*FEMALE

log odds (NEWIRON=1, FEMALE=0, AGE=j ) = b0 + b1 + b(j+1)
log odds (NEWIRON=0, FEMALE=0, AGE=j ) = b0      + b(j+1)

log odds (NEWIRON=1, FEMALE=0, AGE=j ) - 
  log odds (NEWIRON=0, FEMALE=0, AGE=j ) = b1

and

log odds (NEWIRON=1, FEMALE=1, AGE=j ) = b0 + b1 + b2 + b(j+1) + b6
log odds (NEWIRON=0, FEMALE=1, AGE=j ) = b0      + b2 + b(j+1)

log odds (NEWIRON=1, FEMALE=0, AGE=j ) - 
  log odds (NEWIRON=0, FEMALE=0, AGE=j ) = b1 + b6

Therefore, exp( b1 ) is the odds ratio comparing high iron to low
iron for males (FEMALE=0), controlling for age, and exp( b1+b6 ) is
the odds ratio comparing high iron to low iron for females (FEMALE=1),
controlling for age.

If b6=0 then the OR for males is exp( b1 ) and the OR for females is
then also exp( b1 + 0 ) = exp( b1 ).  Therefore, the null hypothesis 
of no interaction, or common odds ratio, or homogeneity of odds ratios
is given by:

     H0:  b6=0
     H1:  b6 not equal to 0

1(n)  Using logistic regression that includes:  newiron female i.age
and ironXgen (the product of newiron and female) leads to the
estimate of b6 = 0.451, with a Z statistic (also called Wald statistic)
of 0.451/0.455 = 0.99, p-value=0.321.  Therefore, we fail to reject
the null hypothesis that b6=0 and conclude that the age-adjusted OR for 
males is not significantly different than the age-adjusted OR for females.


2(a)  Here we have the ordered categorical variable, ALC, that takes the
values 1, 2, 3, 4, representing levels of alcohol use.

The model:

  logit[ pi(X) ] = b0 + b1*ALC

can be used to detect a trend in the odds of disease with increasing categories
of ALC.

Recall that earlier our trend test was based on the following:

Let p(j) = the probability that Y=1 (here disease) for category j of a
covariate (here ALC).

     H0:  p(1)=p(2)=p(3)=p(4)
     H1:  p(1)<=p(2)<=p(3)<=p(4) with one inequality strictly < , or
          p(1)>=p(2)>=p(3)>=p(4) with one inequality strictly >

Notice that the alternative can be viewed in terms of odds ratios:

     p(1) <= p(2) <= p(3) <= p(4)  implies

     p(1)/[1-p(1)] <= p(2)/[1-p(2)] <= p(3)/[1-p(3)] <= p(4)/[1-p(4)]

that is, increasing probabilities implies increasing odds.  Then, this
becomes odds ratios if we divide each of these by p(1)/[1-p(1)]:

     1 <= OR(2,1) <= OR(3,1) <= OR(4,1)  where

OR(j,k) is the odds ratio comparing category j to category k.

Thus,  the alternative hypothesis for the trend test is equivalent to 
looking for an increasing odds ratio (or increasing log odds ratio).

We obtain the following model estimates: (only showing the "beta" estimates)

Logit estimates                                   Number of obs   =        972
                                                  LR chi2(1)      =     143.74
                                                  Prob > chi2     =     0.0000
Log likelihood = -422.18354                       Pseudo R2       =     0.1455

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     alc |   1.044229   .0935454     11.163   0.000       .8608836    1.227575
   _cons |  -3.522645    .228131    -15.441   0.000      -3.969774   -3.075517
------------------------------------------------------------------------------

A trend test in terms of the regression coefficient is:

     H0:  b1=0
     H1:  b1 not equal to zero.

We can use either the likelihood ratio test (given above as LR chi2(1),
with a test statistic of 143.74 on df=1, p-value<0.001) or the Wald
statistic (given above as 1.044/0.0935 = 11.163, p-value<0.001).  Using 
either test we would reject the null hypothesis.


2(b)  Next we fit a model that uses dummy variables.  This can help us to
"see" the increase in the log odds across the categories of the predictor:

     logit[ pi(X) ] = b0 + b1*ALC(2) + b2*ALC(3) + b4*ALC(4)

The output from this fit is:

Logit estimates                                   Number of obs   =        972
                                                  LR chi2(3)      =     145.55
                                                  Prob > chi2     =     0.0000
Log likelihood = -421.27731                       Pseudo R2       =     0.1473

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  Ialc_2 |   1.263438   .2323757      5.437   0.000       .8079897    1.718886
  Ialc_3 |   2.046657   .2611432      7.837   0.000       1.534825    2.558488
  Ialc_4 |   3.296359   .3236825     10.184   0.000       2.661953    3.930765
   _cons |  -2.580739   .1925972    -13.400   0.000      -2.958223   -2.203256
------------------------------------------------------------------------------

Here we find that the estimate of b1 (ALC=2 versus ALC=1) is 1.26, the 
estimate of b2 is 2.05 (ALC=3 versus ALC=1) and the estimate of b3 is
3.30 (ALC=4 versus ALC=1).  These show a very steady increase in the log 
odds as we move up the categories of ALC (approximate 1 unit increases!).
This suggest that the linear model in 1(a) is good.

"Is the model in (a) nested within this model?"  -- Yes.  The next model
fit helps us to see that.


2(c)  This model takes the linear function ALC, and then adds the dummy
variables ALC(3) and ALC(4).  This has the same number of parameters.  Let's
compare this to the dummy variable model

                      log odds ratio

       dummy variables      linear + (2) dummy

ALC=1     b0                   b0 + b1*(1)
ALC=2     b0 + b1              b0 + b1*(2)
ALC=3     b0 + b2              b0 + b1*(3) + b2
ALC=4     b0 + b3              b0 + b1*(4) + b3

*** It is useful to try and graph these models. ***

Here is the fitted model:  (log odds only)

Logit estimates                                   Number of obs   =        972
                                                  LR chi2(3)      =     145.55
                                                  Prob > chi2     =     0.0000
Log likelihood = -421.27731                       Pseudo R2       =     0.1473

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     alc |   1.263438   .2323757      5.437   0.000       .8079897    1.718886
    alc3 |  -.4802187   .3685306     -1.303   0.193      -1.202525    .2420881
    alc4 |  -.4939538   .6067902     -0.814   0.416      -1.683241     .695333
   _cons |  -3.844177   .4065459     -9.456   0.000      -4.640992   -3.047362
------------------------------------------------------------------------------

Notice that we have the following fitted log odds:

ALC=1       -3.844 + 1.263*(1)        =    -2.581  

ALC=2       -3.844 + 1.263*(2)        =    -1.318    = -2.581 + 1.263

ALC=3       -3.844 + 1.263*(3) -0.480 =    -0.535    = -2.581 + 2.046

ALC=4       -3.844 + 1.263*(4) -0.494 =     0.714    = -2.581 + 3.295


This gives exactly the same fitted values (log odds ratios) as the dummy 
variable model!

Clearly the linear model is nested within this model.  We obtain the linear
model as a special case where:

     H0:  b2=b3=0  (coefficients of ALC(3) and ALC(4) are zero)
     H1:  not both are zero.


2(d)  A likelihood ratio test of the model in 1(c) versus the model in 1(a)
yields a likelihood ratio statistic of:

LR = 2*( -421.28 - -422.18 ) = 1.81.  

This has df=2 (2 parameters set to zero) and comparing the LR statistic to
a chi2(df=2) yields:

P[ chi2(df=2) > 19.8 ] = 0.40

Therefore we would not reject the null hypothesis and conclude that the linear
model is adequate.


2(e)  Age adjustment:  Using the model with ALC as a linear term we can add
age group dummy variables.  The fitted model here is:

Logit estimates                                   Number of obs   =        972
                                                  LR chi2(6)      =     254.61
                                                  Prob > chi2     =     0.0000
Log likelihood = -366.74772                       Pseudo R2       =     0.2577

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     alc |   1.093055   .1033092     10.580   0.000       .8905723    1.295537
  Iage_2 |   1.580665   1.075439      1.470   0.142      -.5271578    3.688487
  Iage_3 |    3.30387   1.031849      3.202   0.001       1.281483    5.326256
  Iage_4 |   3.826048   1.027778      3.723   0.000        1.81164    5.840456
  Iage_5 |   4.214519    1.03375      4.077   0.000       2.188407    6.240631
  Iage_6 |   4.307893   1.083363      3.976   0.000       2.184542    6.431245
   _cons |  -6.982177   1.050306     -6.648   0.000      -9.040739   -4.923616
------------------------------------------------------------------------------

The resulting age adjusted odds ratio for ALC is given by:

     exp( 1.093 ) = 2.983

This odds ratio is an estimate of the odds of disease for two groups that
differ by 1 unit in ALC categories, where age is controlled (ie. held fixed).  
That is, this is the odds of disease comparing ALC=2 to ALC=1, or comparing 
ALC=3 to ALC=2, or comparing ALC=4 to ALC=3, for fixed values of age.

The estimate for this odds ratio that did not adjust for age was given in 
2(a) as
 
     exp( 1.044 ) = 2.841

Age adjustment results in only a minor change in the odds ratio estimate. This
suggests that AGE does not confound ALC.


2(f)  Inspection of the coefficients for the AGE dummy variables can be used 
to see if a linear model might be adequate for AGE.  For these coefficients
we find that there is a large change up until AGE category 4, but then the 
increase tails off.  This hints that a linear model will not be adequate. 

Formally, we can test the linear model versus a dummy variable model using
the likelihood ratio test.  For this we use:

     H0:  linear model -- log odds = b0 + b1*ALC  + b2*AGE

     H1:  dummy variable -- log odds = b0 + b1*ALC  + b2*AGE(2) 
                                                    + b3*AGE(3) 
                                                    + b4*AGE(4) 
                                                    + b5*AGE(5) 
                                                    + b6*AGE(6) 

Fitting the null model (not shown) yields a maximized log likelihood of
-375.60.

The likelihood ratio test is then:

LR = 2*( -366.74 - -375.60 ) = 17.71

Comparing this to a chi2(df=4) yields:

P[ chi2(df=4) > 17.71 ] = 0.001

Therefore we would reject H0, the linear model, in favor of the dummy variable
model.


2(g)  Looking back at the grouping of alcohol use we find:

ALC = 1 for <40 g/day of alcohol use

ALC = 2 for 40-79 g/day

ALC = 3 for 80-119 g/day

ALC = 4 for 120+ g/day

Using this, we can say that adjacent categories roughly represent 40 g/day
changes in alcohol use.

Therefore, since b1 (in model with b1*ALC category) represents the increase
in the log odds for a 1 unit category change, we would expect a b1 increase
to be associated with a 40 g/day increase in the continuous alcohol 
measurement.

So, I'd predict that if we used the actual g/day of alcohol use as a
predictor, then we'd find the new coefficient of alcohol to
be about b1/40.