*************************** * Biostatistics 513 * Exercise Set 2, 2002 *************************** 1(a): STATA can provide the following summaries: Age Group | Freq. Percent Cum. ------------+----------------------------------- 25-34 | 116 11.93 11.93 35-44 | 196 20.16 32.10 45-54 | 213 21.91 54.01 55-64 | 242 24.90 78.91 65-74 | 161 16.56 95.47 75+ | 44 4.53 100.00 ------------+----------------------------------- Total | 972 100.00 Tobacco | Freq. Percent Cum. ------------+----------------------------------- 0-9g/day | 525 54.01 54.01 10-19g/day | 233 23.97 77.98 20-29g/day | 132 13.58 91.56 30+g/day | 82 8.44 100.00 ------------+----------------------------------- Total | 972 100.00 Alcohol | Freq. Percent Cum. ------------+----------------------------------- <40g/day | 412 42.39 42.39 40-79g/day | 355 36.52 78.91 80-119g/day | 138 14.20 93.11 120+g/day | 67 6.89 100.00 ------------+----------------------------------- Total | 972 100.00 From this we can make the following statements: AGE -- approximately 70% of the subjects are 45 or older. The median age in this sample is between 45 and 54. TOB -- tobacco use appears quite high in this group. There is no unexposed reference group, although 54% of the subjects consume less than 10 g/day. GET MORE INFO ON grams/cigarette!!! ALC -- the range in alcohol consumption is quite large. A typical "drink" has approximately 14 grams of alcohol. In this sample 42% have less than 40 g/day (less than 3 drinks / day) while 20% have more than 80 g/day (more than 5 drinks/day). 1(b) STATA output: Case/Contr | Tobacco ol Status | 0-9g/day 10-19g/da 20-29g/da 30+g/day | Total -----------+--------------------------------------------+---------- Control | 447 175 99 51 | 772 | 57.90 22.67 12.82 6.61 | 100.00 -----------+--------------------------------------------+---------- Case | 78 58 33 31 | 200 | 39.00 29.00 16.50 15.50 | 100.00 -----------+--------------------------------------------+---------- Total | 525 233 132 82 | 972 | 54.01 23.97 13.58 8.44 | 100.00 Pearson chi2(3) = 29.6382 Pr = 0.000 Case/Contr | Alcohol ol Status | <40g/day 40-79g/da 80-119g/d 120+g/day | Total -----------+--------------------------------------------+---------- Control | 383 280 87 22 | 772 | 49.61 36.27 11.27 2.85 | 100.00 -----------+--------------------------------------------+---------- Case | 29 75 51 45 | 200 | 14.50 37.50 25.50 22.50 | 100.00 -----------+--------------------------------------------+---------- Total | 412 355 138 67 | 972 | 42.39 36.52 14.20 6.89 | 100.00 Pearson chi2(3) = 157.9073 Pr = 0.000 For TOB we can phrase our hypotheses as follows: H0: the distribution of tobacco consumption is the same for cases and controls. H1: the distribution of tobacco consumption is not the same for cases and controls. Here we have Pearson's chi-square statistic equal to 29.64 with (2-1)*(4-1) = 3 degrees of freedom. A value larger than 29.64 under the null hypothesis is very rare (p-value < 0.001) so we reject the null and conclude that there is a difference in tobacco consumption comparing cases and controls. We can see a clear pattern in these data -- more controls have low tobacco consumption (57% in the lowest category) and controls have high tobacco consumption (only 39% in the lowest category). For ALC we can phrase our hypotheses as follows: H0: the distribution of alcohol consumption is the same for cases and controls. H1: the distribution of alcohol consumption is not the same for cases and controls. Here we have Pearson's chi-square statistic equal to 157.91 with (2-1)*(4-1) = 3 degrees of freedom. A value larger than 157.91 under the null hypothesis is very rare (p-value < 0.001) so we reject the null and conclude that there is a difference in alcohol consumption comparing cases and controls. We can see a clear pattern in these data -- more controls have low alcohol consumption (50% in the lowest category) and controls have high alcohol consumption (47% in the highest two categories). Note: Pearson's chi-square does not take the category order into account. 1(c) TOB "dose" effect? ------------+------------------------------------------------------------- tob | Odds ratio chi2 P>chi2 [95% Conf. Interval] ------------+------------------------------------------------------------- 0-9g/day | 1.000000 . . . . 10-19g/da | 1.899341 11.02 0.0009 1.292147 2.791862 20-29g/da | 1.910256 7.72 0.0055 1.200295 3.040153 30+g/day | 3.483409 25.31 0.0000 2.074288 5.849783 ------------+------------------------------------------------------------- Test of homogeneity (equal odds): chi2(3) = 29.61 Pr>chi2 = 0.0000 Score test for trend of odds: chi2(1) = 26.99 Pr>chi2 = 0.0000 Here we see that the odds of disease for the middle categories is approximately 1.9 (with the lowest category as the reference). The highest category has an OR of 3.48 compared to the lowest category. We do see an increase in the risk of disease (measured by the odds ratios) but it doesn't appear to be monotone -- since 20-29 g/day does not appear to have greater risk than 10-19 g/day. 1(d) ALC "dose" effect? ------------+------------------------------------------------------------- alc | Odds ratio chi2 P>chi2 [95% Conf. Interval] ------------+------------------------------------------------------------- <40g/day | 1.000000 . . . . 40-79g/da | 3.537562 32.25 0.0000 2.220617 5.635524 80-119g/d | 7.741974 74.30 0.0000 4.462553 13.431361 120+g/day | 27.014107 159.16 0.0000 12.413378 58.788344 ------------+------------------------------------------------------------- Test of homogeneity (equal odds): chi2(3) = 157.74 Pr>chi2 = 0.0000 Score test for trend of odds: chi2(1) = 151.90 Pr>chi2 = 0.0000 Here we find an increase in the odds of disease for every category of alcohol consumption. The odds ratios shown above compare the odds of disease comparing the given category to the lowest category. The pattern is consistent with a monotone dose effect. 1(e) Trend test for TOB. Define the following: P(1,j) = probability that Y=1 for TOB=j (category j of tobacco). H0: P(1,1)=P(1,2)=P(1,3)=P(1,4) H1: Either P(1,1)<=P(1,2)<=P(1,3)<=P(1,4) with at least one < or P(1,1)>=P(1,2)>=P(1,3)>=P(1,4) with at least one > Since these data are case-control data it would be more appropriate to state these hypotheses in terms of odds ratios. This leads to: Define: OR(j,k) = (odds of disease in category j of TOB)/ (odds of disease in category k of TOB) Then, H0: 1=OR(2,1)=OR(3,1)=OR(4,1) H1: Either 1<=OR(2,1)<=OR(3,1)<=OR(4,1) with at least one < or 1>=OR(2,1)>=OR(3,1)>=OR(4,1) with at least one > From STATA we see that the chi-square statistic is 151.9 with df=1 and p-value < 0.001. Therefore we reject the null in favor of a trend in the odds of disease. 1(f) Trend test for ALC. Define: OR(j,k) = (odds of disease in category j of TOB)/ (odds of disease in category k of TOB) Then, H0: 1=OR(2,1)=OR(3,1)=OR(4,1) H1: Either 1<=OR(2,1)<=OR(3,1)<=OR(4,1) with at least one < or 1>=OR(2,1)>=OR(3,1)>=OR(4,1) with at least one > From STATA we see that the chi-square statistic is 26.99 with df=1 and p-value < 0.001. Therefore we reject the null in favor of a trend in the odds of disease. 1(g) Both ALC and TOB appear to be associated with disease. We may have concern that AGE is a confounder if it is related to both the exposure and the outcome. Let's check: | Case/Control Status Age Group | Control Case | Total -----------+----------------------+---------- 25-34 | 115 1 | 116 | 14.90 0.50 | 11.93 -----------+----------------------+---------- 35-44 | 187 9 | 196 | 24.22 4.50 | 20.16 -----------+----------------------+---------- 45-54 | 167 46 | 213 | 21.63 23.00 | 21.91 -----------+----------------------+---------- 55-64 | 166 76 | 242 | 21.50 38.00 | 24.90 -----------+----------------------+---------- 65-74 | 106 55 | 161 | 13.73 27.50 | 16.56 -----------+----------------------+---------- 75+ | 31 13 | 44 | 4.02 6.50 | 4.53 -----------+----------------------+---------- Total | 772 200 | 972 | 100.00 100.00 | 100.00 Pearson chi2(5) = 96.0779 Pr = 0.000 . tabulate age alc [freq=count], chi2 row | Alcohol Age Group | <40g/day 40-79g/da 80-119g/d 120+g/day | Total -----------+--------------------------------------------+---------- 25-34 | 61 45 5 5 | 116 | 52.59 38.79 4.31 4.31 | 100.00 -----------+--------------------------------------------+---------- 35-44 | 86 80 20 10 | 196 | 43.88 40.82 10.20 5.10 | 100.00 -----------+--------------------------------------------+---------- 45-54 | 78 81 39 15 | 213 | 36.62 38.03 18.31 7.04 | 100.00 -----------+--------------------------------------------+---------- 55-64 | 89 84 43 26 | 242 | 36.78 34.71 17.77 10.74 | 100.00 -----------+--------------------------------------------+---------- 65-74 | 71 53 29 8 | 161 | 44.10 32.92 18.01 4.97 | 100.00 -----------+--------------------------------------------+---------- 75+ | 27 12 2 3 | 44 | 61.36 27.27 4.55 6.82 | 100.00 -----------+--------------------------------------------+---------- Total | 412 355 138 67 | 972 | 42.39 36.52 14.20 6.89 | 100.00 Pearson chi2(15) = 40.9231 Pr = 0.000 . tabulate age tob [freq=count], chi2 row | Tobacco Age Group | 0-9g/day 10-19g/da 20-29g/da 30+g/day | Total -----------+--------------------------------------------+---------- 25-34 | 70 19 11 16 | 116 | 60.34 16.38 9.48 13.79 | 100.00 -----------+--------------------------------------------+---------- 35-44 | 109 43 27 17 | 196 | 55.61 21.94 13.78 8.67 | 100.00 -----------+--------------------------------------------+---------- 45-54 | 104 57 33 19 | 213 | 48.83 26.76 15.49 8.92 | 100.00 -----------+--------------------------------------------+---------- 55-64 | 117 65 38 22 | 242 | 48.35 26.86 15.70 9.09 | 100.00 -----------+--------------------------------------------+---------- 65-74 | 99 38 20 4 | 161 | 61.49 23.60 12.42 2.48 | 100.00 -----------+--------------------------------------------+---------- 75+ | 26 11 3 4 | 44 | 59.09 25.00 6.82 9.09 | 100.00 -----------+--------------------------------------------+---------- Total | 525 233 132 82 | 972 | 54.01 23.97 13.58 8.44 | 100.00 Pearson chi2(15) = 25.3990 Pr = 0.045 We see that the simple Pearson's chi-square statistics suggest the following: AGE is associated with Case/Control status AGE is associated with TOB AGE is associated with ALC 2(a) Summarize TX and Y with 2x2 Table: | yn txn | (-) (+) | Total -----------+----------------------+---------- control | 96 47 | 143 | 67.13 32.87 | 100.00 -----------+----------------------+---------- treatment | 75 55 | 130 | 57.69 42.31 | 100.00 -----------+----------------------+---------- Total | 171 102 | 273 | 62.64 37.36 | 100.00 Pearson chi2(1) = 2.5932 Pr = 0.107 | txn | Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+---------------------- Cases | 55 47 | 102 0.5392 Controls | 75 96 | 171 0.4386 -----------------+------------------------+---------------------- Total | 130 143 | 273 0.4762 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------- Odds ratio | 1.497872 | .916455 2.448146 (Cornfield) Attr. frac. ex. | .3323864 | -.0911611 .5915276 (Cornfield) Attr. frac. pop | .1792279 | +----------------------------------------------- chi2(1) = 2.59 Pr>chi2 = 0.1073 We see from the first table the the odd of a favorable response for the treatment group is estimated as 0.42/(1-0.42) = 0.72 which we can compare to the odds of a favorable response in the control group, 0.33/(1-0.33) = 0.50 which yields the odds ratio 1.49 -- comparing the odds of a favorable response among the treated to the odds of a favorable response among the control subjects. The chi-square statistic tests for association between TX and Y, where the null can be represented as H0: OR=1. The statistic is 2.59 which when compared to a chi-square with df=1 gives a p-value of 0.11. Therefore, we would not reject the null at the 5% significance level. 2(b) Summarize the response to treatment after stratifying on clinic. Here's a summary table showing the success rate for the treatment and control groups within each clinic. ----------+----------- clinic | and txn | mean(yn) ----------+----------- 1 | control | 0.27 treatment | 0.31 | Total | 0.29 ----------+----------- 2 | control | 0.69 treatment | 0.80 | Total | 0.73 ----------+----------- 3 | control | 0.37 treatment | 0.74 | Total | 0.55 ----------+----------- 4 | control | 0.06 treatment | 0.12 | Total | 0.09 ----------+----------- 5 | control | 0.00 treatment | 0.35 | Total | 0.21 ----------+----------- | control | 0.00 treatment | 0.09 | Total | 0.05 ----------+----------- 7 | control | 0.11 treatment | 0.20 | Total | 0.14 ----------+----------- 8 | control | 0.86 treatment | 0.67 | Total | 0.77 ----------+----------- This table shows that the treatment group has a higher success rate for 7 of the 8 clinics. 2(c) The Mantel-Haenszel Test is used to test whether a common odds ratio (pooling across the levels of the stratifying variable) is 1. The null hypothesis is: H0: common OR=1 H1: common OR not equal to 1 Using STATA we can obtain the following table with the Mantel-Haenszel test also output: . cc yn txn [freq=count], by(clinic) clinic | OR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------- 1 | 1.188 .4378957 3.221515 3.424658 (Cornfield) 2 | 1.818182 .5020541 6.468947 1.692308 (Cornfield) 3 | 4.8 1.237863 18.56123 .9210526 (Cornfield) 4 | 2.285714 .2626826 . .4242424 (Cornfield) 5 | . 1.476771 . 0 (Cornfield) 6 | . 0 . 0 (Cornfield) 7 | 2 0 . .2857143 (Cornfield) 8 | .3333333 0 3.720182 .9230769 (Cornfield) -----------------+------------------------------------------------- Crude | 1.497872 .916455 2.448146 (Cornfield) M-H combined | 2.134549 1.17759 3.869174 -----------------+------------------------------------------------- Test of homogeneity (M-H) chi2(5) = 4.46 Pr>chi2 = 0.4851 Test that combined OR = 1: Mantel-Haenszel chi2(1) = 6.38 Pr>chi2 = 0.0115 . We find a chi-square statistic of 6.38 with df=1 that yields a p-value of 0.01. Therefore, we would reject the null hypothesis. 2(d) Test of homogeneity of odds ratios: Define: OR(j) = odds ratio comparing TX=1 to TX=0 at clinic j. H0: OR(1)=OR(2)=...=OR(8) H1: at least one OR is different Note that the homogeneity test typically has df=(strata - 1) but since we have two strata with zero cells we can not calculate an OR for those strata. Thus, STATA wisely drops these from Woolf's test and returns a test statistic of 4.46 based on 6 clinics, with df=5, and p-value 0.49. Therefore we do not reject the null hypothesis and accept the homogeneity assumption. NOTE: there are other homogeneity tests that STATA will perform such as the Breslow and Day test (option bd). 2(e) The estimate of the common odds ratio and confidence interval appears above or can be obtained from "mhodds". Mantel-Haenszel estimate controlling for clinic ---------------------------------------------------------------- Odds ratio chi2(1) P>chi2 [95% Conf. Interval] ---------------------------------------------------------------- 2.134549 6.38 0.0115 1.168685 3.898656 ---------------------------------------------------------------- The CI suggests that the data support odds ratio values between 1.17 and 3.90. There are two differences in the stratified summary: point estimate is larger (now 2.13, unstratified is 1.6); and the estimate achieves the nominal 5% significance. 2(f) Is clinic a confounder? It might appear to be based on the large change in the estimated odds ratio obtained after controlling for clinic. However, there was roughly a balance between the TX=0 and TX=1 groups at each clinic. From 2(b) we can clearly see that clinic is associated with outcome (Y). Is clinic associated with TX? | txn clinic | control treatment | Total -----------+----------------------+---------- 1 | 37 36 | 73 | 50.68 49.32 | 100.00 -----------+----------------------+---------- 2 | 32 20 | 52 | 61.54 38.46 | 100.00 -----------+----------------------+---------- 3 | 19 19 | 38 | 50.00 50.00 | 100.00 -----------+----------------------+---------- 4 | 17 16 | 33 | 51.52 48.48 | 100.00 -----------+----------------------+---------- 5 | 12 17 | 29 | 41.38 58.62 | 100.00 -----------+----------------------+---------- 6 | 10 11 | 21 | 47.62 52.38 | 100.00 -----------+----------------------+---------- 7 | 9 5 | 14 | 64.29 35.71 | 100.00 -----------+----------------------+---------- 8 | 7 6 | 13 | 53.85 46.15 | 100.00 -----------+----------------------+---------- Total | 143 130 | 273 | 52.38 47.62 | 100.00 Pearson chi2(7) = 4.3335 Pr = 0.741 We do see some imbalance in the treatment within clinics -- particularly clinics 2 and 7. There is likely another explanation -- this stems from the fact that odds ratio that adjusts for clinic has a different interpretation than the crude odds ratios and that differences between crude and adjusted odds ratios can result from either confounding or adjustment for a variable that is strongly predictive of outcome yet not associated with the exposure of interest. The concept of a "precision" variable, that is a variable that is predictive of outcome but not related to exposure, doesn't work as cleanly for odds ratios. In linear regression a precision variable does not change the estimated coefficient for the predictor of interest but does reduce the amount of unexplained variation (thus reduced residual variance) and thus provides more precision. In logistic regression a precision variable can improve precision but will also change the magnitude of estimated odds ratios. This issue is raised in the course notes on pages 84-85. 2(g) We see that controlling for clinic clearly impacts our inference. The odds ratio estimate is larger and the significant (p-value) is impacted. We can quote the odds ratio obtained after stratification, 2.13, as indicating that within any clinic we would estimate that the odds of a favorable outcome among treated subjects is twice the odds among untreated (control) subjects. However, we can not use this estimate alone to quote a success rate since we see that the clinic is an important factor here. We may use the simple 2x2 table to say that over all clinics we see a 42% favorable rate on treatment and a 33% success rate for the control subjects. If we knew the clinic that the subject was attending then we might modify our estimate to include that information. We find that the observed rate of successful treatment varies greatly from clinic-to-clinic, ranging from a low of 9% to a high of 80%. 3. Q: Is iron associated with CHD? In order to assess the association between iron (dichotomized at >350mg) we will estimate the odds ratio comparing the odds of disease among subjects with high iron (>350mg) to the odds of disease among subjects with low iron consumption (<=350mg). We will also compute the odds ratio after adjusting for gender and age, as these may be potential confounders. Univariate summaries: The data contain 570 controls and 338 subjects with CHD. Overall, there are 270/908 = 30% of subjects with iron consumption greater than 350mg/month. Also, 216 (or 24%) of subjects are under 50 years of age, while 291 (32%) are aged 50-59, 302 (33%) are aged 60-69, and 99 subjects (11%) are aged 70 or older. Females comprise only 32% of the subjects (291 women and 617 men). Bivariate summaries: First we consider associations between disease status and covariates. We find that 35% of the cases have high iron consumption while only 27% of the controls have high iron consumption. Also, the cases tend to be older with 48% aged 60 or older as compared to 41% of controls. A small fraction of the cases are women (only 13%) while women comprise 43% of the controls. Thus, we see a large difference between the cases and controls in terms of gender, and a modest difference in terms of iron consumption, and a small difference in ages. Next we consider associations between the predictor of interest, iron consumption, and the covariates age and gender. The age distribution of subjects that consume low and high levels of iron are quite similar as shown in the following table: | age newiron | <= 49 50-59 60-69 >= 70 | Total ---------------+--------------------------------------------+---------- <=350 mg/month | 142 203 215 78 | 638 | 22.26 31.82 33.70 12.23 | 100.00 ---------------+--------------------------------------------+---------- > 350 mg/month | 74 88 87 21 | 270 | 27.41 32.59 32.22 7.78 | 100.00 ---------------+--------------------------------------------+---------- Total | 216 291 302 99 | 908 | 23.79 32.05 33.26 10.90 | 100.00 However, we find that only 16% of subjects with high iron are female as compared to 39% of subjects with low iron. Thus, iron consumption appears to vary greatly with gender, yet appear not to be association with age. Confirmatory Analysis: The crude odds ratio comparing the odds of CHD among subjects with high iron consumption to the odds of CHD among subjects with low iron consumption is estimated as 1.475 (95% confidence interval: 1.09, 1.99). However, after adjusting for gender we obtain an adjusted odds ratio of 1.07 (95% confidence interval: 0.79, 1.45). If we adjust for both gender and age we obtain an adjusted odds ratio estimate of 1.10 with 95% confidence interval (0.81, 1.51). Thus, although we find a significant crude association between disease and high iron consumption, this association is greatly diminished and is not statistically significant after controlling for gender. As we found in the bivariate analysis above, gender is strongly associated with both the exposure of interest and with CHD, and thus an analysis such as the crude odds ratio that does not control for gender would lead to a biased assessment of the impact of high iron consumption (ie. a crude comparison of NewIron=1 to NewIron=0 is comparing one group with 16% women (NewIron=1) to another group with 39% women (NewIron=0) and thus blurs the effects of gender and iron).