*************************** * Biostatistics 513 * Exercise Set 5, 2002 *************************** 1. On the course web page there is a STATA "do" file that lists the commands for this analysis. 1(a) Here is the fitted model (log odds ratios, regression coefficients): ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- Iagegp_2 | 1.766668 1.099572 1.607 0.108 -.3884536 3.921789 Iagegp_3 | 3.552665 1.056528 3.363 0.001 1.481907 5.623422 Iagegp_4 | 4.082665 1.05282 3.878 0.000 2.019177 6.146154 Iagegp_5 | 4.619252 1.06237 4.348 0.000 2.537045 6.701458 Iagegp_6 | 4.654789 1.112717 4.183 0.000 2.473905 6.835674 alc | .0259659 .0025951 10.006 0.000 .0208797 .0310521 tob | .0404762 .0078778 5.138 0.000 .025036 .0559163 _cons | -7.160526 1.084426 -6.603 0.000 -9.285961 -5.035091 ------------------------------------------------------------------------------ Note: In the homework key for exercise set #4, question 2(g), we predicted that the alc coefficient would be b1/40 = 1.093/40 = 0.0273. Not a bad guess since the real estimate (above) is 0.0260! The interpretation of the alc coefficient is the increase in the log odds of disease for a 1g/day increase in alcohol consumptions where other factors (age, tobacco) remain constant. From this we see that a 10g/day increase (approx 1 drink/day) leads to an increase of 10*0.0260 = 0.260, corresponding to an odds ratio of exp( 0.260 ) = 1.30. Similarly the tob coefficient estimates the increase in the log odds of disease associated with a 1g/day increase in tobacco consumption. 1(b) The following table can help calculate the likelihood ratio tests: model # parameters -2*logL ------------------------------------------------------- (1) age + alc + tob 8 -347.74 (2) age + alc + tob + alc^2 9 -347.55 (3) age + alc + tob + tob^2 9 -346.97 A comparison of model 1 to model 2 can be used to test whether the coefficient for alc^2 is zero. We obtain a likelihood ratio test of 2*(-347.5 - -347.74) = 0.38 with a p-value P[ chi(1) > 0.38 ] = 0.5396 indicating that we would not reject the null hypotheses and conclude that alc^2 is not significant. Similarly we obtain a likelihood ratio test of 2*(-346.97 - -347.74) = 1.55 with a p-value of P[ chi(1) > 1.55 ] = 0.2138. We conclude that the tob^2 term is not significant. 1(optional) First we want to to some basic descriptive statistics on the component alcohol variables. I calculated means and medians for each drink, and calculated means and medians for the cases and controls: mean 10% 25% 50% 75% 90% Beer control 5.706 0 0 0 5 25 case 9.075 0 0 0 2 39.8 Wine control 15.491 0 0 2 27 47 case 30.76 0 0 15.5 53 89.4 Cider control 17.831 0 2 10 26 48 case 34.345 0 3.25 22 50.75 82 Aperitif control 1.091 0 0 1 1 3 case 1.090 0 0 0 1 2 Digestive control 4.294 0 0 0 4 14 case 9.754 0 0 3 14 27 From these descriptives we see that the cases report higher levels of consumption for each component, with the exception of aperitifs. We also see that the dose of consumption is lower for both digestifs and aperitifs. Finally, we note that the percentiles reveal that for each component there is a moderate to large fraction that consume 0g/day of this item. The following output shows the logistic regression model coefficients where the total alcohol (alc) is separated into the components that sum to make this total: ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- Iagegp_2 | 1.851368 1.118682 1.655 0.098 -.3412084 4.043945 Iagegp_3 | 3.638629 1.077053 3.378 0.001 1.527643 5.749615 Iagegp_4 | 4.181356 1.076306 3.885 0.000 2.071836 6.290876 Iagegp_5 | 4.759659 1.087517 4.377 0.000 2.628164 6.891154 Iagegp_6 | 4.802468 1.138712 4.217 0.000 2.570633 7.034302 tob | .0412979 .0082171 5.026 0.000 .0251928 .057403 beer | .0262809 .0055056 4.774 0.000 .0154902 .0370716 wine | .0288202 .0038235 7.538 0.000 .0213263 .0363141 cider | .0314869 .0039517 7.968 0.000 .0237417 .039232 aperitif | -.0684994 .044496 -1.539 0.124 -.1557099 .0187111 digest | .0132737 .0084572 1.570 0.117 -.0033022 .0298495 _cons | -7.296944 1.108967 -6.580 0.000 -9.470479 -5.123409 ------------------------------------------------------------------------------ These estimates are remarkably close to the aggregate alcohol estimate (0.260) suggesting that it is the alcohol content that leads to the association with disease. The estimate for digestive is smaller, although the common value of 0.260 is within the 95% confidence interval. The aperitif coefficient is the exception, but our earlier summary of this variable indicates that it is infrequently consumed in this population. Note: since the total alcohol variable is the sum of these 5 components, the model with alcohol total is actually nested within this richer model. For the total alcohol model we are assuming that the coefficient for each of these 5 components is equal to a common value. Thus, we reduce the 5 parameters to a single parameter (a change in degrees of freedom of df=4). We can then formally compare the two models: 2*( logL1 - logL2 ) = 2*( -343.35 - -347.55 ) = 8.790 The corresponding p-value is P[ chi(4) > 8.790 ] = 0.066. Therefore, we have moderate (but non-significant) evidence for a difference in the 5 coefficients. (optional) Here is the logistic regression model that drops the age dummy variables: ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- alc | .0249946 .0023573 10.603 0.000 .0203744 .0296148 tob | .0285065 .0067586 4.218 0.000 .0152598 .0417532 _cons | -3.288127 .2049569 -16.043 0.000 -3.689835 -2.886418 ------------------------------------------------------------------------------ We see that coefficient of alc has changed (0.0260 - 0.0250)/0.0260 = 3.8%. We see that coefficient of tob has changed (0.0405 - 0.0285)/0.0405 = 29.6%. Thus, the impact of age adjustment is large for the tobacco exposure. 2. The Framingham data. (a) For this initial analysis I created categories for each of the continuous covariates. I made some initial guesses for categorization based on the range of the measurements and my (minimal?) knowledge of these measurements. However, for several of the measurements I later modified the groupings to obtain a more even distribution of subjects across the categories. The goal for categorization is to obtain meaningful interval ranges based on the nature of the measurement (ie. what are normal ranges) and to obtain roughly equally sized groups. Below shows the groupings that I finally used. Given the grouped data I used the "tabodds" command to characterize the odds ratio pattern, and used logistic regression with a plot of the fitted log odds to characterize the potential form (linear, quadratic) for subsequent logistic regression. *** BMI *** . table Cbmi chd ----------+----------- | death from | CHD Cbmi | 0 1 ----------+----------- <22 | 124 23 22-26 | 302 62 26-30 | 240 67 30+ | 66 25 ----------+----------- . tabodds chd Cbmi, or ------------+------------------------------------------------------------- Cbmi | Odds ratio chi2 P>chi2 [95% Conf. Interval] ------------+------------------------------------------------------------- <22 | 1.000000 . . . . 22-26 | 1.106824 0.14 0.7034 0.656208 1.866877 26-30 | 1.505072 2.38 0.1228 0.892246 2.538811 30+ | 2.042161 4.86 0.0275 1.067982 3.904953 ------------+------------------------------------------------------------- Test of homogeneity (equal odds): chi2(3) = 7.54 Pr>chi2 = 0.0565 Score test for trend of odds: chi2(1) = 6.94 Pr>chi2 = 0.0084 logistic regression: ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ICbmi_2 | .1014947 .2664265 0.381 0.703 -.4206916 .623681 ICbmi_3 | .408841 .2657721 1.538 0.124 -.1120627 .9297448 ICbmi_4 | .7140084 .3266409 2.186 0.029 .073804 1.354213 _cons | -1.684787 .2270303 -7.421 0.000 -2.129759 -1.239816 ----------------------------------------------------------------------------- *** AGE *** . table Cage chd ----------+----------- | death from | CHD Cage | 0 1 ----------+----------- < 45 | 226 41 45-50 | 174 35 50-55 | 153 48 55 + | 179 53 ----------+----------- . tabodds chd Cage, or ------------+------------------------------------------------------------- Cage | Odds ratio chi2 P>chi2 [95% Conf. Interval] ------------+------------------------------------------------------------- < 45 | 1.000000 . . . . 45-50 | 1.108775 0.17 0.6813 0.677180 1.815443 50-55 | 1.729316 5.40 0.0201 1.083224 2.760771 55 + | 1.632102 4.54 0.0330 1.035590 2.572214 ------------+------------------------------------------------------------- Test of homogeneity (equal odds): chi2(3) = 8.04 Pr>chi2 = 0.0452 Score test for trend of odds: chi2(1) = 6.57 Pr>chi2 = 0.0104 logistic regression: ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ICage_2 | .1032557 .251264 0.411 0.681 -.3892126 .595724 ICage_3 | .547726 .2370323 2.311 0.021 .0831513 1.012301 ICage_4 | .489869 .2308019 2.122 0.034 .0375055 .9422325 _cons | -1.706963 .1697499 -10.056 0.000 -2.039667 -1.374259 ------------------------------------------------------------------------------ *** SMOKE *** . table Csmoke chd ----------+----------- | death from | CHD Csmoke | 0 1 ----------+----------- none | 258 66 1-15 | 127 33 16-30 | 230 42 30+ | 117 36 ----------+----------- . tabodds chd Csmoke, or ------------+------------------------------------------------------------- Csmoke | Odds ratio chi2 P>chi2 [95% Conf. Interval] ------------+------------------------------------------------------------- none | 1.000000 . . . . 1-15 | 1.015748 0.00 0.9480 0.635314 1.623990 16-30 | 0.713834 2.42 0.1200 0.465764 1.094027 30+ | 1.202797 0.62 0.4327 0.757861 1.908953 ------------+------------------------------------------------------------- Test of homogeneity (equal odds): chi2(3) = 4.72 Pr>chi2 = 0.1933 Score test for trend of odds: chi2(1) = 0.01 Pr>chi2 = 0.9131 logistic regression: ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ICsmok_2 | .0156253 .2391747 0.065 0.948 -.4531485 .4843991 ICsmok_3 | -.3371048 .2172207 -1.552 0.121 -.7628496 .0886399 ICsmok_4 | .1846498 .2352706 0.785 0.433 -.2764721 .6457718 _cons | -1.363305 .1379401 -9.883 0.000 -1.633663 -1.092947 ------------------------------------------------------------------------------ *** CHOL1 *** . table Cchol1 chd ----------+----------- | death from | CHD Cchol1 | 0 1 ----------+----------- <=200 | 261 47 201-250 | 313 76 251-300 | 129 38 300+ | 29 16 ----------+----------- . tabodds chd Cchol1, or ------------+------------------------------------------------------------- Cchol1 | Odds ratio chi2 P>chi2 [95% Conf. Interval] ------------+------------------------------------------------------------- <=200 | 1.000000 . . . . 201-250 | 1.348379 2.16 0.1415 0.903850 2.011534 251-300 | 1.635824 4.13 0.0421 1.012850 2.641970 300+ | 3.063830 11.00 0.0009 1.526351 6.149998 ------------+------------------------------------------------------------- Test of homogeneity (equal odds): chi2(3) = 12.04 Pr>chi2 = 0.0072 Score test for trend of odds: chi2(1) = 10.59 Pr>chi2 = 0.0011 logistic regression: ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ICchol_2 | .298903 .2036192 1.468 0.142 -.1001834 .6979893 ICchol_3 | .4921466 .2432606 2.023 0.043 .0153645 .9689287 ICchol_4 | 1.119666 .3494148 3.204 0.001 .4348253 1.804506 _cons | -1.714373 .1584551 -10.819 0.000 -2.024939 -1.403807 ------------------------------------------------------------------------------ *** DBP *** . table Cdbp chd ----------+----------- | death from | CHD Cdbp | 0 1 ----------+----------- 50-80 | 147 26 80-100 | 458 95 100-120 | 110 45 120+ | 17 11 ----------+----------- . tabodds chd Cdbp, or ------------+------------------------------------------------------------- Cdbp | Odds ratio chi2 P>chi2 [95% Conf. Interval] ------------+------------------------------------------------------------- 50-80 | 1.000000 . . . . 80-100 | 1.172741 0.44 0.5081 0.731220 1.880857 100-120 | 2.312937 9.42 0.0021 1.332845 4.013727 120+ | 3.658371 9.39 0.0022 1.504173 8.897701 ------------+------------------------------------------------------------- Test of homogeneity (equal odds): chi2(3) = 20.06 Pr>chi2 = 0.0002 Score test for trend of odds: chi2(1) = 16.41 Pr>chi2 = 0.0001 logistic regression: ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ICdbp_2 | .1593438 .2407779 0.662 0.508 -.3125722 .6312597 ICdbp_3 | .8385182 .2767262 3.030 0.002 .2961448 1.380892 ICdbp_4 | 1.297018 .4415844 2.937 0.003 .4315284 2.162508 _cons | -1.732336 .212754 -8.142 0.000 -2.149326 -1.315346 ------------------------------------------------------------------------------ *** SBP *** . table Csbp chd ----------+----------- | death from | CHD Csbp | 0 1 ----------+----------- <120 | 93 12 120-140 | 293 51 140-160 | 228 61 160-200 | 103 41 200+ | 15 12 ----------+----------- . tabodds chd Csbp, or ------------+------------------------------------------------------------- Csbp | Odds ratio chi2 P>chi2 [95% Conf. Interval] ------------+------------------------------------------------------------- <120 | 1.000000 . . . . 120-140 | 1.348976 0.77 0.3809 0.688875 2.641606 140-160 | 2.073465 4.77 0.0290 1.061877 4.048733 160-200 | 3.084951 10.49 0.0012 1.503874 6.328272 200+ | 6.200000 15.62 0.0001 2.200841 17.466053 ------------+------------------------------------------------------------- Test of homogeneity (equal odds): chi2(4) = 27.71 Pr>chi2 = 0.0000 Score test for trend of odds: chi2(1) = 25.57 Pr>chi2 = 0.0000 logistic regression: ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ICsbp_2 | .2993459 .3422078 0.875 0.382 -.3713691 .9700608 ICsbp_3 | .7292211 .3389171 2.152 0.031 .0649559 1.393486 ICsbp_4 | 1.126536 .3580284 3.146 0.002 .4248131 1.828259 ICsbp_5 | 1.824549 .49405 3.693 0.000 .8562292 2.792869 _cons | -2.047693 .3067334 -6.676 0.000 -2.648879 -1.446506 ------------------------------------------------------------------------------ SUMMARY: We find a monotone increasing risk for age, bmi, cholesterol, and blood pressure. We find a non-linear pattern with smoking where there is a decreased (but not significant) risk for intermediate exposure. We find significant trend tests for each covariate except smoking. (b) The following table summarizes the logistic regressions that used each covariate individually, and gives the likelihood ratio test associated with the addition of a quadratic or a square root term: variable coef. s.e. Z LR-quadratic LR-sqrt AGE 0.0358 (0.0133) 2.683 1.31 (p=0.25) 1.32 (p=0.25) BMI 0.0624 (0.0240) 2.598 0.10 (p=0.75) 0.14 (p=0.71) SMOKE 0.0001 (0.0061) 0.014 1.44 (p=0.23) 1.10 (p=0.30) CHOL1 0.0070 (0.0019) 3.700 0.18 (p=0.67) 0.33 (p=0.56) DBP 0.0320 (0.0062) 5.125 0.47 (p=0.49) 0.40 (p=0.53) SBP 0.0191 (0.0035) 5.524 0.28 (p=0.60) 0.28 (p=0.60) From these analyses we find that a linear logistic model appears adequate for each covariate. Again we find that smoking is the exception. The (main effect) smoking coefficient is not significant, but there is some weak evidence for a non-linear relationship. This is counter to my intuition since our graphical exploration shows that the quadratic model allows a lower disease risk for subjects with some smoking relative to subjects with no smoking. The likelihood ratio tests assess whether the addition of a quadratic term (or a square root term) yield a substantial improvement in the likelihood (fit to the data). We find no significant departures from linearity. (optional) See the STATA do file for analyses that used linear splines and fractional polynomial regression. Graphically, these models generally agree with the quality of the dose-response as characterized by the quadratic regression models. That is, the splines either confirmed that a linear logistic relationship was approximately correct, or confirmed evidence for some (usually not much) curvature as show using X^2. These additional more flexible models help reassure that using the covariates as continuous linear terms in our multivariate logistic regression model should be adequate. (c) Since I remained somewhat unsure of the smoking variable I fit the multivariate model with smoking and smoking^2. This attempts to see if there is evidence of a relationship *after* controlling for the other covariates. The fitted model yields: ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age | .032678 .0146261 2.234 0.025 .0040115 .0613446 bmi | .0163449 .0265879 0.615 0.539 -.0357664 .0684563 chol1 | .0055988 .0019446 2.879 0.004 .0017875 .0094102 dbp | .0148751 .0101179 1.470 0.142 -.0049556 .0347059 sbp | .0098526 .0055761 1.767 0.077 -.0010763 .0207814 smoke1c | -.0006565 .0096152 -0.068 0.946 -.0195019 .018189 smoke2c | .0003653 .0003833 0.953 0.340 -.0003859 .0011165 _cons | -7.58881 1.11913 -6.781 0.000 -9.782264 -5.395356 ------------------------------------------------------------------------------ Quadratic term is not significant based on Wald or LR test (not shown). Simplifying by removing the quadratic term yields: ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age | .0329705 .0146135 2.256 0.024 .0043286 .0616124 bmi | .0196026 .0263294 0.745 0.457 -.032002 .0712072 chol1 | .0055802 .0019441 2.870 0.004 .0017698 .0093906 dbp | .0147605 .0101078 1.460 0.144 -.0050505 .0345715 sbp | .0098348 .0055763 1.764 0.078 -.0010945 .020764 smoke1c | .0061376 .0064327 0.954 0.340 -.0064702 .0187454 _cons | -7.619579 1.117239 -6.820 0.000 -9.809327 -5.429831 ------------------------------------------------------------------------------ Linear term for smoking is not significant; further simplification yields: ------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age | .0303175 .0143369 2.115 0.034 .0022177 .0584174 bmi | .0172298 .0262211 0.657 0.511 -.0341626 .0686222 chol1 | .0057144 .0019353 2.953 0.003 .0019214 .0095074 dbp | .0143368 .0101008 1.419 0.156 -.0054603 .0341339 sbp | .0097978 .0055709 1.759 0.079 -.001121 .0207165 _cons | -7.389625 1.089318 -6.784 0.000 -9.524648 -5.254601 ------------------------------------------------------------------------------ Since smoking is an easy to obtain measurement and since we have a relatively large data set. And since I have apriori evidence (bias?) I will leave smoking in the model as a linear term. I will also leave in BMI -- but we may wonder why BMI was associated with CHD in our earlier single-predictor analysis yet find it not to be significant in our multvariate analysis. Further analysis shows that BMI is correlated with SMOKE, SBP, DBP, and CHOL1. Finally, don't be deceived by the p-values for DBP and SBP -- these two variables are correlated with each other -- but the log likelihood changes by 10.7 when both are removed -- this (times 2) is a significant likelihood ratio test on 2 degrees of freedom (p<0.001). Thus blood pressure is important. (d) Here are a few subjects: (first 2) Observation 1 pfit .6881639 chd 0 age 59 bmi 28.35671 chol1 246 dbp 130 sbp 260 smoke 20 Observation 2 pfit .6434307 chd 0 age 56 bmi 31.85398 chol1 309 dbp 126 sbp 216 smoke 15 These subjects are older (50+), with high BMI (both clinically obese or nearly so), with high blood pressure, and high cholesterol. The model predicts that these subjects have a grerater than 60% risk of death due to CHD in the subsequent 30 years. However, neither died from CHD!. The first subject died of other causes and the second subject died of stroke.