QSCI 482 HOMEWORK 2, DUE FRIDAY, OCTOBER 11, 2002
1. This problem is about the relationship between level of
significance(alpha) and the "P-value" ["P" denotes probability].
For the example from Topic 1, "The Language of Singles Bars", the computed
value of the test statistic was found to be 6.98. The "P-value" of
.0305 in the notes was actually obtained via computer. Let's see how to get "close
to it" using
ONLY Table B.1, and interpret the P-value.
a. Using ONLY Table B.1, find the shortest possible range for the probability
exceeded by the value 6.98 for a chi-square distribution with 2 degrees of
freedom. Your answer should look like, "L < P-value < U", where "L" is
the lower bound for this probability and "U" is the upper bound.
a. 0.025<P(c2³ 6.98)<0.05
b. Now let's interpret. If the null hypothesis were true, the
probability of seeing a test statistic as large as the one that we actually
observed (6.98) is equal to what?--that's your answer to part [1a]. Now,
compare your part [1a] answer to the specified level of significance for
this test (.05), and explain how your answer to part [1a] lends evidence
either for,
or against, the null hypothesis of uniformity among the 3
categories.
b. Assuming the null hypothesis were true, the probability of
seeing a test statistic as large as 6.98 is between 0.025 and 0.05, exclusive
(actually the exact probability is 0.0305 as stated in the introduction to this
problem). When assessing the P-value, we reject the null hypothesis if P £ a, or we fail to
reject the null if P > a. Here, since the range for the P-value is between
0.025 and 0.05 (exclusive), it is less than the level of significance (a=0.05), and we
thus reject the null and conclude that the 3 categories are not uniform.
P-values are probabilities and are represented by areas under the
Chi-square curve. The same is true for the level of significance (remember that
Òrejection regionsÓ for these Chi-square tests are graphically depicted as the
area equal to a in the right tail of the curve, here 0.05). Larger Chi-square
test statistics are related to smaller P-values. As a matter of fact as soon as
the Chi-square test statistic gets large enough that its related P-value passes
the point that defines the edge of the Òrejection regionÓ (hence falling in the
rejection region), the test statistic is Òtoo largeÓ (the related P-value is
Òtoo smallÓ) and we would reject the null hypothesis.
Though not really clarified up to this point in the class, it is
important to note that this really amounts to two different ways to do a
hypothesis test. Why? Well, if using only the P-value approach, you never need
to determine a critical value as we simply compare the P-value (derived from
our test statistic) to the stated level of significance.
2. A biological oceanographer is studying the distribution of
zooplankton in the water column. She wants to know if the zooplankton
are uniformly
distributed in the mixed layer (an oceanography term we don't need
to worry about). She has divided that layer into 5 sub-layers
and counted the number of zooplankton occurring in each after releasing 60
zooplankton into a large mesocosm (large enough that the zooplankton may be
considered independent and don't eat each other!).
The zooplankton distributed themselves as follows:
Sub-layer No. Zoopl.
Surface
06
Sub-surface
08
Mid
13
Lower Mid 15
Deep Mixed 18
a. If she analyzes this as a goodness-of-fit problem
(treating the layers
simply as named categories and not considering their order), what
are her
results? Include the p-value associated with your test
statistic.
STEP 1:
H0 Biologically:
H0: The zooplankton
are uniformly distributed in the mixed layer.
STEP 2:
Ha: The
zooplankton are not uniformly distributed in the mixed layer.
STEP 3:
H0
Mathematically:
H0:
P=(1/5)=P(surface)=P(sub-surface)=P(mid)=P(lower mid)=P(deep mixed).
Ha: At least one
P¹(1/5).
STEP 4:
Choose test statistic:
Appropriate test is a Chi-square Goodness-of-Fit test with 4
degrees of freedom.
STEP 5:
Assumptions:
-Data are categorical.
-Sample observations are a random sample of the population and
are independent.
STEP 6:
Significance level and critical value:
a=0.05. So, our
critical value is c20.05;4=9.488.
STEP 7:
Compute test statistic:
Sub-layer fi fihat (fi- fihat)2/ fihat
Surface
06 12
3.0000
Sub-surface
08 12 1.3333
Mid
13
12 0.0833
Lower Mid
15 12
0.7500
Deep Mixed
18 12 3.0000
8.1666
STEP 8:
Since c2OBS<c2CRIT (8.166<9.488), at a=0.05 (0.05<P(c2³8.166)<0.10) we fail to reject the null hypothesis and hence
conclude that the distribution of zooplankton is uniform across the mixed
layer.
b. If she now recognizes that she has ordered categories and
uses a test appropriate to ordered categories, what are her results?
Include the
p-value here, too.
STEP 1:
H0 Biologically:
H0: The
zooplankton are uniformly distributed in the mixed layer.
STEP 2:
Ha: The
zooplankton are not uniformly distributed in the mixed layer.
STEP 3:
H0
Mathematically:
H0: P=(1/5)=P(surface)=P(sub-surface)=P(mid)=P(lower
mid)=P(deep mixed).
Ha: At least one
P¹(1/5).
STEP 4:
Choose test statistic:
Since data can be logically ordered, the Kolmogorov-Smirnov test
(k=5) is appropriate.
STEP 5:
Assumptions:
-Categories are ordered
-n=multiple of k
-Data are categorical.
-Sample observations are a random sample of the population and
are independent.
STEP 6:
Significance level and critical value:
a=0.05. (For
critical value see Step 7).
STEP 7:
Compute test statistic:
Sub-layer
fi fihat Fi Fihat |di|
Surface
06 12 6 12 6
Sub-surface
08 12 14 24 10
Mid
13 12 27 36 9
Lower Mid
15 12 42 48 6
Deep Mixed
18 12 60 60 0
dmax=10
Critical value:
(dmax)0.05;5;60=9
STEP 8:
Since dOBS>dCRIT (10<9), at a=0.05 (0.02<P(dmax³10)<0.05) we
reject the null hypothesis and hence conclude that the distribution of
zooplankton is not uniform across the mixed layer.
c. Which was the "better" test to use and why?
Since the K-S test utilizes ÒextraÓ information (I.e. the ordered
nature of the data), it is a better test to use in this case. By including this
extra information, the K-S test is more likely to detect differences from
uniformity than is the chi-square test.
3. The following data are frequencies of ferrets in two geographic
areas, with and without a particular disease. Test the null
hypothesis (use the .10 level of significance) that the prevalence of the disease is
the same in both areas. Do this in three ways: compute X2 , X2 (Yates)
and X2(Cochran-Haber). Compare the three values of the test statistic and comment on their
relationship to each other.
AREA WITH DISEASE WITHOUT DISEASE
Area 1
20
39
Area 2
16
51
STEP 1:
H0 Biologically:
H0: The prevalence
of disease for ferrets is the same in the two geographic areas.
STEP 2:
Ha: The prevalence
of disease for ferrets is not the same in the two geographic areas.
STEP 3:
H0
Mathematically:
H0: pij=pi.p.j
for i=1Ér, j=1Ér
Ha: At least one pij¹pi.p.j for i=1Ér, j=1Ér
STEP 4:
Choose test statistic:
We will use a Chi-square test statistic, however in addition to
uncorrected Chi-square, Yates corrected and Cochran-Haber corrected test
statistics will be calculated since df=1.
STEP 5:
Assumptions:
-Sample observations are independent.
-The data are collected as a random sample.
STEP 6:
Significance level and critical value:
a=0.10. Critical
value: c20.10;1=2.706
STEP 7:
Compute test statistic:
Area w/disease w/o
disease
Area 1 20 39
(16.8571) (42.1429)
Area 2 16 51
(19.1429) (47.8571)
uncorrected: c2=[(20-16.8571)2/16.8571]+[(39-42.1429)2/42.1429]+[(16-19.1429)2/19.1429]+[(51-47.8571)2/47.8571)]= 0.5860 + 0.2344 + 0.5160 + 0.2064 = 1.5428
Yates corrected: |20(51)-16(39)| = 396
396> 126/2 >> Yates correction appropriate.
c2y=
[(396-(126/2))2]126/[(36)(90)(59)(67)]=1.0909
Cochran-Haber corrected: fhatmin=16.8571
Corresponding fij=20=f
20<2(16.8571) >> use Cochran-Haber where D = 3
c2C=126332/[(36)(90)(59)(67)]=1.4057
STEP 8:
Since all c2OBS<c2CRIT we would fail to reject the null for
all tests at a=0.10 and hence
conclude that the distribution disease among ferrets is the same among the two
geographic locations.
b. For the ferrets of Area 1, estimate the probability of disease
from the data.
c. For the ferrets of Area 2, estimate the probability of disease
from the data.
p=36/67=0.2388
d. For the ferrets of the two Areas combined, estimate the
probability of disease from the data.
p=36/126=0.2857