***************************
* Biostatistics 513 * Exercise Set 1, 2002
***************************
1. For this question I used the "csi" command and added the "or" option so
that an odds ratio would also be calcuated. This procedure returns all of the
statistics that are desired for (a) through (e):
. csi 61 27 75 312, or
| Exposed Unexposed | Total
-----------------+------------------------+----------
Cases | 61 27 | 88
Noncases | 75 312 | 387
-----------------+------------------------+----------
Total | 136 339 | 475
| |
Risk | .4485294 .079646 | .1852632
| |
| Point estimate | [95% Conf. Interval]
|------------------------+----------------------
Risk difference | .3688834 | .2804678 .457299
Risk ratio | 5.631536 | 3.748488 8.460531
Attr. frac. ex. | .8224286 | .7332257 .8818041
Attr. frac. pop | .5700925 |
Odds ratio | 9.398519 | 5.610924 15.73917 (Cornfield)
+-----------------------------------------------
chi2(1) = 87.50 Pr>chi2 = 0.0000
1(a) This sample is a cross-sectional sample and so we can either use
a chi-square test of independence, or a chi-square test of the homogeneity
of HIV risk across the IV drug use strata (groups). The one issue to recall
with this design is that since we have a cross-sectional sample we can talk
about the probability, or risk, of *having* disease (HIV+) rather than the
risk of *acquiring* disease. Stated equivalently, we can talk about disease
prevalence not incidence.
Define: p1 = probability of HIV+ if IV drug user
p2 = probability of HIV+ if not an IV drug user
H0: p1 = p2
H1: p1 not equal to p2
Pearson's chi-square test yields X2=87.5 with 1 degree of freedom. The
probability of observing data as or more extreme (further from the null)
if the null hypothesis were true is <0.001 (the p-value). Therefore,
we conclude that the risk of HIV+ does depend on whether women report
IV drug use or not.
1(b) The risk difference is the difference between the probability of
HIV+ among the IVDU women and the probability of HIV+ among the non-IVDU
women. From the table we see that we estimate p1 as p1hat=61/136 and p2
as p2hat=27/339. The risk difference is estimated as p1hat - p2hat =
0.449 - 0.080 = 0.369. A 95% confidence interval for the risk difference
is (0.280, 0.4573) capturing values of the risk difference, p1-p2, that are
consistent with the observed data. A summary sentence may be:
"Among women that report IV drug use we find 61/136 = 44.9% seropositive while
among women that do not report IV drug use we find 27/339 = 8.0% seropositive.
The increased risk associated with IV drug use can be summarized by the
risk difference 44.9% - 8.0% = 36.9% (95% confidence interval 28.0%, 45.7%).
These results suggest a relatively high prevalence of HIV among incarcerated
women that do not report IV drug use (8%), but an additional 36.9% are
seropositive among the women that do report IV drug use."
1(c) The risk ratio is simply the probability of HIV+ among the IVDU women
relative to the probability of HIV+ among the non-IVDU women. This is
estimated using the estimates p1hat and p2hat by RRhat = p1hat/p2hat =
0.449/0.080 = 5.631 with a 95% confidence interval (3.748, 8.461). A summary
sentence might look quite similar to the previous:
"Among women that report IV drug use we find 61/136 = 44.9% seropositive while
among women that do not report IV drug use we find 27/339 = 8.0% seropositive.
The increased risk associated with IV drug use can be summarized by the
risk ratio 44.9% / 8.0% = 5.631 (95% confidence interval 3.748, 8.4661).
These results suggest a relatively high prevalence of HIV among incarcerated
women that do not report IV drug use (8%), but more than a 5-fold increase
in the likelihood of being seropositive among women that also report IV drug
use."
1(d) The odds ratio is similar to the risk ratio with the difference being
that the odds (ie. p/(1-p)) are compared for the two groups. Again this is
estimated as follows:
estimated odds among IVDU : p1hat/(1-p1hat) = 0.449/(1-0.449) = 0.815
estimated odds among non-IVDU: p2hat/(1-p2hat) = 0.080/(1-0.080) = 0.0870
estimated odds ratio comparing IVDU to non-IVDU: 0.815/0.0870 = 9.399
A 95% confidence interval for the odds ratio is (5.611, 15.74). A summary
sentence might look like:
"Among women that report IV drug use we find 61/136 = 44.9% seropositive while
among women that do not report IV drug use we find 27/339 = 8.0% seropositive.
The increased risk associated with IV drug use can be summarized by the
odds ratio 9.399 (95% confidence interval 5.611, 15.74) that compares the odds
of seropositivity among IDVU women relative to the odds among non-IVDU women.
These results suggest a relatively high prevalence of HIV among incarcerated
women that do not report IV drug use (8%), but more than a 9-fold increase
in the odds of being seropositive among women that also report IV drug
use."
2. This question focuses on the three variables: ICgroup, nurse0, and nurse6.
(a) Q: Did randomization work?
To answer this question we compare the baseline measurements of the ICgroup==0
and the ICgroup==1 groups. Specifically, for the nurse0 item we can use
"tabulate ICgroup nurse0, row chi" and obtain:
Informed | allocation knowledge
Consent | at t=0
group | 0 1 | Total
-----------+----------------------+----------
0 | 192 308 | 500
| 38.40 61.60 | 100.00
-----------+----------------------+----------
1 | 191 309 | 500
| 38.20 61.80 | 100.00
-----------+----------------------+----------
Total | 383 617 | 1000
| 38.30 61.70 | 100.00
Pearson chi2(1) = 0.0042 Pr = 0.948
From this summary we find that 61.6% of the control group answered
correctly at baseline and 61.8% of the intervention group answered
correctly. Groups are comparable at baseline (as would be expected
by randomization).
(b) Compare the 6 month reponse for the two groups.
For this we can use either "cs" or "cc" to obtain the inference that
could be used to formally compare the two groups at the 6 month follow-up
visit.
. tabulate ICgroup nurse6, row chi
Informed | allocation knowledge
Consent | at t=6
group | 0 1 | Total
-----------+----------------------+----------
0 | 174 326 | 500
| 34.80 65.20 | 100.00
-----------+----------------------+----------
1 | 83 417 | 500
| 16.60 83.40 | 100.00
-----------+----------------------+----------
Total | 257 743 | 1000
| 25.70 74.30 | 100.00
Pearson chi2(1) = 43.3671 Pr = 0.000
And,
. cs nurse6 ICgroup
| Informed Consent group |
| Exposed Unexposed | Total
-----------------+------------------------+----------
Cases | 417 326 | 743
Noncases | 83 174 | 257
-----------------+------------------------+----------
Total | 500 500 | 1000
| |
Risk | .834 .652 | .743
| |
| Point estimate | [95% Conf. Interval]
|------------------------+----------------------
Risk difference | .182 | .12902 .23498
Risk ratio | 1.279141 | 1.186676 1.378811
Attr. frac. ex. | .2182254 | .15731 .2747375
Attr. frac. pop | .1224764 |
+-----------------------------------------------
chi2(1) = 43.37 Pr>chi2 = 0.0000
From these displays we see that among the ICgroup==1 subjects 83.4% correctly
answered the nurse item while only 65.2% of the control group answered
correctly. The 18.2% improvement attributable to the intervention is
statistically significant (95% confidence interval 12.9%, 23.4%).
(c) Another analysis uses the pre/post data for the intervention group.
To obtain McNemar's test and association odds ratio we use "mcc".
. tabulate nurse0 nurse6 if ICgroup==1
allocation | allocation knowledge
knowledge | at t=6
at t=0 | 0 1 | Total
-----------+----------------------+----------
0 | 45 146 | 191
1 | 38 271 | 309
-----------+----------------------+----------
Total | 83 417 | 500
. mcc nurse6 nurse0 if ICgroup==1
| Controls |
Cases | Exposed Unexposed | Total
-----------------+------------------------+----------
Exposed | 271 146 | 417
Unexposed | 38 45 | 83
-----------------+------------------------+----------
Total | 309 191 | 500
McNemar's chi2(1) = 63.39 Pr>chi2 = 0.0000
Exact McNemar significance probability = 0.0000
Proportion with factor
Cases .834
Controls .618 [95% conf. interval]
--------- --------------------
difference .216 .1643124 .2676876
ratio 1.349515 1.253175 1.45326
rel. diff. .565445 .4736866 .6572035
odds ratio 3.842105 2.672885 5.645144 (exact)
Note that the labels "cases" and "controls" aren't appropriate
for our analysis. Here the "cases" would be the observations taken
at month 6, and the controls would be the observations taken at
baseline.
Using McNemar's test allows us to assess the null hypothesis: Among
the ICgroup==1 subjects the probability of answering correctly at
month 6 is the same as the probability of answering correctly at
baseline. We obain a chi-square of 63.4 and therefore reject the
null with p<0.001.
(d) Let's focus on who understood the knowledge item at follow-up. Did
intervention "correct" the subjects that ansered incorrectly at baseline
and/or "reinforce" those subjects that answered correctly at baseline.
First let's consider the subjects that answered incorrectly at baseline:
. cs nurse6 ICgroup if nurse0==0
| Informed Consent group |
| Exposed Unexposed | Total
-----------------+------------------------+----------
Cases | 146 91 | 237
Noncases | 45 101 | 146
-----------------+------------------------+----------
Total | 191 192 | 383
| |
Risk | .7643979 .4739583 | .618799
| |
| Point estimate | [95% Conf. Interval]
|------------------------+----------------------
Risk difference | .2904396 | .1976471 .383232
Risk ratio | 1.612796 | 1.362649 1.908863
Attr. frac. ex. | .3799586 | .2661352 .4761279
Attr. frac. pop | .2340673 |
+-----------------------------------------------
chi2(1) = 34.24 Pr>chi2 = 0.0000
From this summary we find that of the 191 subjects that answered incorrectly
at baseline in the intervention group, 146/191 = 76.4% answered correctly
at follow-up. In the control group only 47.3% of the subjects that answered
incorrectly at baseline answered correctly at follow-up. Therefore,
intervention improved the "correction" of incorrect understanding by
a statistically significant 29.0%.
(e) Among subjects that answered correctly at baseline:
. cs nurse6 ICgroup if nurse0==1
| Informed Consent group |
| Exposed Unexposed | Total
-----------------+------------------------+----------
Cases | 271 235 | 506
Noncases | 38 73 | 111
-----------------+------------------------+----------
Total | 309 308 | 617
| |
Risk | .8770227 .762987 | .8200972
| |
| Point estimate | [95% Conf. Interval]
|------------------------+----------------------
Risk difference | .1140356 | .0540666 .1740047
Risk ratio | 1.149459 | 1.066456 1.238923
Attr. frac. ex. | .1300259 | .0623151 .1928473
Attr. frac. pop | .0696384 |
+-----------------------------------------------
chi2(1) = 13.60 Pr>chi2 = 0.0002
Here we also find an effect of the intervention. Among the 309 intervention
subjects that answered correctly at baseline, 271/309 = 87.7% answered
correctly at follow-up compared to only 235/308 = 76.2% of the control
subjects. A confidence interval for the risk difference is (5.4%, 17.4%)
indicating a significant impact of intervention even among subjects that
answered correctly at baseline.
3. Regression thinking.
(a) In question 2(e) we can use the regression model:
P( nurse6==1 ) = E[ nurse6 ] = beta0 + beta1*ICgroup
where we keep in mind that we are restricting to those subjects that answered
incorrectly at baseline (ie. nurse0==0).
In this model we have:
beta0 = the percent answering correctly at month 6 in the control group
(among subjects that answered incorrectly at baseline)
beta1 = the percent answering correctly at month 6 in the intervention group
MINUS
the percent answering correctly at month 6 in the control group
(among subjects that answered incorrectly at baseline)
Thus beta1 is the "risk difference" among nurse0==0 subjects.
(b) In question 2(f) we can use the regression model:
P( nurse6==1 ) = E[ nurse6 ] = beta0 + beta1*ICgroup
where we keep in mind that we are restricting to those subjects that answered
correctly at baseline (ie. nurse0==1).
In this model we have:
beta0 = the percent answering correctly at month 6 in the control group
(among subjects that answered correctly at baseline)
beta1 = the percent answering correctly at month 6 in the intervention group
MINUS
the percent answering correctly at month 6 in the control group
(among subjects that answered correctly at baseline)
Thus beta1 is the "risk difference" among nurse0==1 subjects.
(c) We can combine these as:
P( nurse6==1 ) = beta0 + beta1*ICgroup + beta2*nurse0 + beta3*nurse0*ICgroup
In this model we have:
beta0 = the percent answering correctly at month 6 in the control group
(among subjects that answered incorrectly at baseline)
beta1 = the percent answering correctly at month 6 in the intervention group
MINUS
the percent answering correctly at month 6 in the control group
(among subjects that answered incorrectly at baseline)
beta0 + beta2 = the percent answering correctly at month 6 in the control
group among subjects that answered correctly at baseline.
beta1 + beta3 = the percent answering correctly at month 6 in the
intervention group MINUS the percent answering correctly
at month 6 in the control group among subjects that
answered correctly at baseline
So we find beta1 to be the "treatment effect" among subjects with nurse0==0,
and beta1+beta3 to be the treatment effect among subjects with nurse0==1.
Here "treatment effect" refers to the difference in the average respose among
intervention subjects minus the average response among control subjects.
Alternatively,
beta2 = for the control group, the difference between the probability of
correctly answering nurse6 comparing subjects that answered correctly
at baseline, nurse0==1, to subjects that answered incorrectly at
baseline, nurse0==0.
beta3 = the difference in the effect of treatment comparing subjects that
answered correctly at baseline, nurse0==1, to the effect of
treatment among subjects that answered incorrectly at baseline.
Again, the "effect of treatment" specifically means the difference between
the percent answering correctly for the intervention subjects (ICgroup==1)
and the percent answering correctly for the control subjects (ICgroup==0).
(d) The standard assumpions are:
Linearity -- not a concern since using 0/1 predictor variables.
Independence -- data are independent since using a single outcome (nurse6)
for each subject.
Normality -- clearly the data are not normal. However, this assumption
is not necessary for application to large data sets (more
than 100 observations).
Equal variance -- the errors, y-(X beta), do not have equal variances with
binary data.
Therefore, we clearly violate two of these assumptions -- and will develop
logistic regression methods to allow regression analysis with binary response
data.