*************************** * Biostatistics 513 * Exercise Set 1, 2002 *************************** 1. For this question I used the "csi" command and added the "or" option so that an odds ratio would also be calcuated. This procedure returns all of the statistics that are desired for (a) through (e): . csi 61 27 75 312, or | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 61 27 | 88 Noncases | 75 312 | 387 -----------------+------------------------+---------- Total | 136 339 | 475 | | Risk | .4485294 .079646 | .1852632 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------- Risk difference | .3688834 | .2804678 .457299 Risk ratio | 5.631536 | 3.748488 8.460531 Attr. frac. ex. | .8224286 | .7332257 .8818041 Attr. frac. pop | .5700925 | Odds ratio | 9.398519 | 5.610924 15.73917 (Cornfield) +----------------------------------------------- chi2(1) = 87.50 Pr>chi2 = 0.0000 1(a) This sample is a cross-sectional sample and so we can either use a chi-square test of independence, or a chi-square test of the homogeneity of HIV risk across the IV drug use strata (groups). The one issue to recall with this design is that since we have a cross-sectional sample we can talk about the probability, or risk, of *having* disease (HIV+) rather than the risk of *acquiring* disease. Stated equivalently, we can talk about disease prevalence not incidence. Define: p1 = probability of HIV+ if IV drug user p2 = probability of HIV+ if not an IV drug user H0: p1 = p2 H1: p1 not equal to p2 Pearson's chi-square test yields X2=87.5 with 1 degree of freedom. The probability of observing data as or more extreme (further from the null) if the null hypothesis were true is <0.001 (the p-value). Therefore, we conclude that the risk of HIV+ does depend on whether women report IV drug use or not. 1(b) The risk difference is the difference between the probability of HIV+ among the IVDU women and the probability of HIV+ among the non-IVDU women. From the table we see that we estimate p1 as p1hat=61/136 and p2 as p2hat=27/339. The risk difference is estimated as p1hat - p2hat = 0.449 - 0.080 = 0.369. A 95% confidence interval for the risk difference is (0.280, 0.4573) capturing values of the risk difference, p1-p2, that are consistent with the observed data. A summary sentence may be: "Among women that report IV drug use we find 61/136 = 44.9% seropositive while among women that do not report IV drug use we find 27/339 = 8.0% seropositive. The increased risk associated with IV drug use can be summarized by the risk difference 44.9% - 8.0% = 36.9% (95% confidence interval 28.0%, 45.7%). These results suggest a relatively high prevalence of HIV among incarcerated women that do not report IV drug use (8%), but an additional 36.9% are seropositive among the women that do report IV drug use." 1(c) The risk ratio is simply the probability of HIV+ among the IVDU women relative to the probability of HIV+ among the non-IVDU women. This is estimated using the estimates p1hat and p2hat by RRhat = p1hat/p2hat = 0.449/0.080 = 5.631 with a 95% confidence interval (3.748, 8.461). A summary sentence might look quite similar to the previous: "Among women that report IV drug use we find 61/136 = 44.9% seropositive while among women that do not report IV drug use we find 27/339 = 8.0% seropositive. The increased risk associated with IV drug use can be summarized by the risk ratio 44.9% / 8.0% = 5.631 (95% confidence interval 3.748, 8.4661). These results suggest a relatively high prevalence of HIV among incarcerated women that do not report IV drug use (8%), but more than a 5-fold increase in the likelihood of being seropositive among women that also report IV drug use." 1(d) The odds ratio is similar to the risk ratio with the difference being that the odds (ie. p/(1-p)) are compared for the two groups. Again this is estimated as follows: estimated odds among IVDU : p1hat/(1-p1hat) = 0.449/(1-0.449) = 0.815 estimated odds among non-IVDU: p2hat/(1-p2hat) = 0.080/(1-0.080) = 0.0870 estimated odds ratio comparing IVDU to non-IVDU: 0.815/0.0870 = 9.399 A 95% confidence interval for the odds ratio is (5.611, 15.74). A summary sentence might look like: "Among women that report IV drug use we find 61/136 = 44.9% seropositive while among women that do not report IV drug use we find 27/339 = 8.0% seropositive. The increased risk associated with IV drug use can be summarized by the odds ratio 9.399 (95% confidence interval 5.611, 15.74) that compares the odds of seropositivity among IDVU women relative to the odds among non-IVDU women. These results suggest a relatively high prevalence of HIV among incarcerated women that do not report IV drug use (8%), but more than a 9-fold increase in the odds of being seropositive among women that also report IV drug use." 2. This question focuses on the three variables: ICgroup, nurse0, and nurse6. (a) Q: Did randomization work? To answer this question we compare the baseline measurements of the ICgroup==0 and the ICgroup==1 groups. Specifically, for the nurse0 item we can use "tabulate ICgroup nurse0, row chi" and obtain: Informed | allocation knowledge Consent | at t=0 group | 0 1 | Total -----------+----------------------+---------- 0 | 192 308 | 500 | 38.40 61.60 | 100.00 -----------+----------------------+---------- 1 | 191 309 | 500 | 38.20 61.80 | 100.00 -----------+----------------------+---------- Total | 383 617 | 1000 | 38.30 61.70 | 100.00 Pearson chi2(1) = 0.0042 Pr = 0.948 From this summary we find that 61.6% of the control group answered correctly at baseline and 61.8% of the intervention group answered correctly. Groups are comparable at baseline (as would be expected by randomization). (b) Compare the 6 month reponse for the two groups. For this we can use either "cs" or "cc" to obtain the inference that could be used to formally compare the two groups at the 6 month follow-up visit. . tabulate ICgroup nurse6, row chi Informed | allocation knowledge Consent | at t=6 group | 0 1 | Total -----------+----------------------+---------- 0 | 174 326 | 500 | 34.80 65.20 | 100.00 -----------+----------------------+---------- 1 | 83 417 | 500 | 16.60 83.40 | 100.00 -----------+----------------------+---------- Total | 257 743 | 1000 | 25.70 74.30 | 100.00 Pearson chi2(1) = 43.3671 Pr = 0.000 And, . cs nurse6 ICgroup | Informed Consent group | | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 417 326 | 743 Noncases | 83 174 | 257 -----------------+------------------------+---------- Total | 500 500 | 1000 | | Risk | .834 .652 | .743 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------- Risk difference | .182 | .12902 .23498 Risk ratio | 1.279141 | 1.186676 1.378811 Attr. frac. ex. | .2182254 | .15731 .2747375 Attr. frac. pop | .1224764 | +----------------------------------------------- chi2(1) = 43.37 Pr>chi2 = 0.0000 From these displays we see that among the ICgroup==1 subjects 83.4% correctly answered the nurse item while only 65.2% of the control group answered correctly. The 18.2% improvement attributable to the intervention is statistically significant (95% confidence interval 12.9%, 23.4%). (c) Another analysis uses the pre/post data for the intervention group. To obtain McNemar's test and association odds ratio we use "mcc". . tabulate nurse0 nurse6 if ICgroup==1 allocation | allocation knowledge knowledge | at t=6 at t=0 | 0 1 | Total -----------+----------------------+---------- 0 | 45 146 | 191 1 | 38 271 | 309 -----------+----------------------+---------- Total | 83 417 | 500 . mcc nurse6 nurse0 if ICgroup==1 | Controls | Cases | Exposed Unexposed | Total -----------------+------------------------+---------- Exposed | 271 146 | 417 Unexposed | 38 45 | 83 -----------------+------------------------+---------- Total | 309 191 | 500 McNemar's chi2(1) = 63.39 Pr>chi2 = 0.0000 Exact McNemar significance probability = 0.0000 Proportion with factor Cases .834 Controls .618 [95% conf. interval] --------- -------------------- difference .216 .1643124 .2676876 ratio 1.349515 1.253175 1.45326 rel. diff. .565445 .4736866 .6572035 odds ratio 3.842105 2.672885 5.645144 (exact) Note that the labels "cases" and "controls" aren't appropriate for our analysis. Here the "cases" would be the observations taken at month 6, and the controls would be the observations taken at baseline. Using McNemar's test allows us to assess the null hypothesis: Among the ICgroup==1 subjects the probability of answering correctly at month 6 is the same as the probability of answering correctly at baseline. We obain a chi-square of 63.4 and therefore reject the null with p<0.001. (d) Let's focus on who understood the knowledge item at follow-up. Did intervention "correct" the subjects that ansered incorrectly at baseline and/or "reinforce" those subjects that answered correctly at baseline. First let's consider the subjects that answered incorrectly at baseline: . cs nurse6 ICgroup if nurse0==0 | Informed Consent group | | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 146 91 | 237 Noncases | 45 101 | 146 -----------------+------------------------+---------- Total | 191 192 | 383 | | Risk | .7643979 .4739583 | .618799 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------- Risk difference | .2904396 | .1976471 .383232 Risk ratio | 1.612796 | 1.362649 1.908863 Attr. frac. ex. | .3799586 | .2661352 .4761279 Attr. frac. pop | .2340673 | +----------------------------------------------- chi2(1) = 34.24 Pr>chi2 = 0.0000 From this summary we find that of the 191 subjects that answered incorrectly at baseline in the intervention group, 146/191 = 76.4% answered correctly at follow-up. In the control group only 47.3% of the subjects that answered incorrectly at baseline answered correctly at follow-up. Therefore, intervention improved the "correction" of incorrect understanding by a statistically significant 29.0%. (e) Among subjects that answered correctly at baseline: . cs nurse6 ICgroup if nurse0==1 | Informed Consent group | | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 271 235 | 506 Noncases | 38 73 | 111 -----------------+------------------------+---------- Total | 309 308 | 617 | | Risk | .8770227 .762987 | .8200972 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------- Risk difference | .1140356 | .0540666 .1740047 Risk ratio | 1.149459 | 1.066456 1.238923 Attr. frac. ex. | .1300259 | .0623151 .1928473 Attr. frac. pop | .0696384 | +----------------------------------------------- chi2(1) = 13.60 Pr>chi2 = 0.0002 Here we also find an effect of the intervention. Among the 309 intervention subjects that answered correctly at baseline, 271/309 = 87.7% answered correctly at follow-up compared to only 235/308 = 76.2% of the control subjects. A confidence interval for the risk difference is (5.4%, 17.4%) indicating a significant impact of intervention even among subjects that answered correctly at baseline. 3. Regression thinking. (a) In question 2(e) we can use the regression model: P( nurse6==1 ) = E[ nurse6 ] = beta0 + beta1*ICgroup where we keep in mind that we are restricting to those subjects that answered incorrectly at baseline (ie. nurse0==0). In this model we have: beta0 = the percent answering correctly at month 6 in the control group (among subjects that answered incorrectly at baseline) beta1 = the percent answering correctly at month 6 in the intervention group MINUS the percent answering correctly at month 6 in the control group (among subjects that answered incorrectly at baseline) Thus beta1 is the "risk difference" among nurse0==0 subjects. (b) In question 2(f) we can use the regression model: P( nurse6==1 ) = E[ nurse6 ] = beta0 + beta1*ICgroup where we keep in mind that we are restricting to those subjects that answered correctly at baseline (ie. nurse0==1). In this model we have: beta0 = the percent answering correctly at month 6 in the control group (among subjects that answered correctly at baseline) beta1 = the percent answering correctly at month 6 in the intervention group MINUS the percent answering correctly at month 6 in the control group (among subjects that answered correctly at baseline) Thus beta1 is the "risk difference" among nurse0==1 subjects. (c) We can combine these as: P( nurse6==1 ) = beta0 + beta1*ICgroup + beta2*nurse0 + beta3*nurse0*ICgroup In this model we have: beta0 = the percent answering correctly at month 6 in the control group (among subjects that answered incorrectly at baseline) beta1 = the percent answering correctly at month 6 in the intervention group MINUS the percent answering correctly at month 6 in the control group (among subjects that answered incorrectly at baseline) beta0 + beta2 = the percent answering correctly at month 6 in the control group among subjects that answered correctly at baseline. beta1 + beta3 = the percent answering correctly at month 6 in the intervention group MINUS the percent answering correctly at month 6 in the control group among subjects that answered correctly at baseline So we find beta1 to be the "treatment effect" among subjects with nurse0==0, and beta1+beta3 to be the treatment effect among subjects with nurse0==1. Here "treatment effect" refers to the difference in the average respose among intervention subjects minus the average response among control subjects. Alternatively, beta2 = for the control group, the difference between the probability of correctly answering nurse6 comparing subjects that answered correctly at baseline, nurse0==1, to subjects that answered incorrectly at baseline, nurse0==0. beta3 = the difference in the effect of treatment comparing subjects that answered correctly at baseline, nurse0==1, to the effect of treatment among subjects that answered incorrectly at baseline. Again, the "effect of treatment" specifically means the difference between the percent answering correctly for the intervention subjects (ICgroup==1) and the percent answering correctly for the control subjects (ICgroup==0). (d) The standard assumpions are: Linearity -- not a concern since using 0/1 predictor variables. Independence -- data are independent since using a single outcome (nurse6) for each subject. Normality -- clearly the data are not normal. However, this assumption is not necessary for application to large data sets (more than 100 observations). Equal variance -- the errors, y-(X beta), do not have equal variances with binary data. Therefore, we clearly violate two of these assumptions -- and will develop logistic regression methods to allow regression analysis with binary response data.