QSCI 482    HOMEWORK 2, DUE FRIDAY, OCTOBER 11, 2002

 

1.  This problem is about the relationship between level of significance(alpha) and the "P-value" ["P" denotes probability]. For the example from Topic 1, "The Language of Singles Bars", the computed value of the test statistic was found to be 6.98.  The "P-value" of .0305 in the notes was actually obtained via computer. Let's see how to get "close to it" using

ONLY Table B.1, and interpret the P-value.

 

a. Using ONLY Table B.1, find the shortest possible range for the probability exceeded by the value 6.98 for a chi-square distribution with 2 degrees of freedom. Your answer should look like, "L < P-value < U", where "L" is the lower bound for this probability and "U" is the upper bound.

 

a. 0.025<P(c2³ 6.98)<0.05

 

 

b. Now let's interpret. If the null hypothesis were true, the probability of seeing a test statistic as large as the one that we actually observed (6.98) is equal to what?--that's your answer to part [1a]. Now, compare your part [1a] answer to the specified level of significance for this test (.05), and explain how your answer to part [1a] lends evidence either for,

or against, the null hypothesis of uniformity among the 3 categories.

 

b. Assuming the null hypothesis were true, the probability of seeing a test statistic as large as 6.98 is between 0.025 and 0.05, exclusive (actually the exact probability is 0.0305 as stated in the introduction to this problem). When assessing the P-value, we reject the null hypothesis if P £ a, or we fail to reject the null if P > a. Here, since the range for the P-value is between 0.025 and 0.05 (exclusive), it is less than the level of significance (a=0.05), and we thus reject the null and conclude that the 3 categories are not uniform.

 

P-values are probabilities and are represented by areas under the Chi-square curve. The same is true for the level of significance (remember that Òrejection regionsÓ for these Chi-square tests are graphically depicted as the area equal to a in the right tail of the curve, here 0.05). Larger Chi-square test statistics are related to smaller P-values. As a matter of fact as soon as the Chi-square test statistic gets large enough that its related P-value passes the point that defines the edge of the Òrejection regionÓ (hence falling in the rejection region), the test statistic is Òtoo largeÓ (the related P-value is Òtoo smallÓ) and we would reject the null hypothesis.     

 

Though not really clarified up to this point in the class, it is important to note that this really amounts to two different ways to do a hypothesis test. Why? Well, if using only the P-value approach, you never need to determine a critical value as we simply compare the P-value (derived from our test statistic) to the stated level of significance.

 

 

2. A biological oceanographer is studying the distribution of zooplankton in the water column.  She wants to know if the zooplankton are uniformly

distributed in the mixed layer (an oceanography term we don't need to worry about).  She has divided that layer into 5 sub-layers and counted the number of zooplankton occurring in each after releasing 60 zooplankton into a large mesocosm (large enough that the zooplankton may be considered independent and don't eat each other!).

 

The zooplankton distributed themselves as follows:

 

Sub-layer   No. Zoopl.

Surface           06   

Sub-surface       08   

Mid               13

Lower Mid         15

Deep Mixed        18

 

a.  If she analyzes this as a goodness-of-fit problem (treating the layers

simply as named categories and not considering their order), what are her

results?  Include the p-value associated with your test statistic.

 

STEP 1:

H0 Biologically:

H0: The zooplankton are uniformly distributed in the mixed layer.

 

STEP 2:

Ha: The zooplankton are not uniformly distributed in the mixed layer.

 

STEP 3:

H0 Mathematically:

H0: P=(1/5)=P(surface)=P(sub-surface)=P(mid)=P(lower mid)=P(deep mixed).

Ha: At least one P¹(1/5).

 

STEP 4:

Choose test statistic:

Appropriate test is a Chi-square Goodness-of-Fit test with 4 degrees of freedom.

 

STEP 5:

Assumptions:

-Data are categorical.

-Sample observations are a random sample of the population and are independent.

 

STEP 6:

Significance level and critical value:

a=0.05. So, our critical value is c20.05;4=9.488.

 

STEP 7:

Compute test statistic:

Sub-layer         fi   fihat    (fi- fihat)2/ fihat

Surface           06    12          3.0000     

Sub-surface       08    12          1.3333

Mid               13    12          0.0833

Lower Mid         15    12          0.7500

Deep Mixed        18    12          3.0000

                                    8.1666

 

STEP 8:

Since c2OBS<c2CRIT (8.166<9.488), at a=0.05 (0.05<P(c2³8.166)<0.10) we fail to reject the null hypothesis and hence conclude that the distribution of zooplankton is uniform across the mixed layer.

 

 

b.  If she now recognizes that she has ordered categories and uses a test appropriate to ordered categories, what are her results?  Include the

p-value here, too.

 

STEP 1:

H0 Biologically:

H0: The zooplankton are uniformly distributed in the mixed layer.

 

STEP 2:

Ha: The zooplankton are not uniformly distributed in the mixed layer.

 

STEP 3:

H0 Mathematically:

H0: P=(1/5)=P(surface)=P(sub-surface)=P(mid)=P(lower mid)=P(deep mixed).

Ha: At least one P¹(1/5).

 

STEP 4:

Choose test statistic:

Since data can be logically ordered, the Kolmogorov-Smirnov test (k=5) is appropriate.

 

STEP 5:

Assumptions:

-Categories are ordered

-n=multiple of k

-Data are categorical.

-Sample observations are a random sample of the population and are independent.

 

STEP 6:

Significance level and critical value:

a=0.05. (For critical value see Step 7).

 

STEP 7:

Compute test statistic:

Sub-layer         fi    fihat    Fi   Fihat     |di|

Surface           06    12     6    12       6   

Sub-surface       08    12     14   24       10

Mid               13    12     27   36       9

Lower Mid         15    12     42   48       6

Deep Mixed        18    12     60   60       0

 

dmax=10

 

Critical value:

(dmax)0.05;5;60=9

                                

STEP 8:

Since dOBS>dCRIT (10<9), at a=0.05 (0.02<P(dmax³10)<0.05) we reject the null hypothesis and hence conclude that the distribution of zooplankton is not uniform across the mixed layer. 

 

c.  Which was the "better" test to use and why?

 

Since the K-S test utilizes ÒextraÓ information (I.e. the ordered nature of the data), it is a better test to use in this case. By including this extra information, the K-S test is more likely to detect differences from uniformity than is the chi-square test.

 

3. The following data are frequencies of ferrets in two geographic areas, with and without a particular disease.  Test the null hypothesis (use the .10 level of significance) that the prevalence of the disease is the same in both areas. Do this in three ways: compute X2 , X2 (Yates) and X2(Cochran-Haber). Compare the three values of the test statistic and comment on their relationship to each other. 

 

 

            AREA  WITH DISEASE      WITHOUT DISEASE

 

            Area 1       20               39

 

            Area 2       16               51

 

STEP 1:

H0 Biologically:

H0: The prevalence of disease for ferrets is the same in the two geographic areas.

 

STEP 2:

Ha: The prevalence of disease for ferrets is not the same in the two geographic areas.

 

STEP 3:

H0 Mathematically:

H0: pij=pi.p.j for i=1Ér, j=1Ér

Ha: At least one pij¹pi.p.j for i=1Ér, j=1Ér

 

STEP 4:

Choose test statistic:

We will use a Chi-square test statistic, however in addition to uncorrected Chi-square, Yates corrected and Cochran-Haber corrected test statistics will be calculated since df=1.

 

STEP 5:

Assumptions:

-Sample observations are independent.

-The data are collected as a random sample.

 

STEP 6:

Significance level and critical value:

a=0.10. Critical value: c20.10;1=2.706

 

STEP 7:

Compute test statistic:

Area     w/disease     w/o disease

Area 1        20            39

           (16.8571)     (42.1429)

 

Area 2        16            51

           (19.1429)     (47.8571)

 

uncorrected: c2=[(20-16.8571)2/16.8571]+[(39-42.1429)2/42.1429]+[(16-19.1429)2/19.1429]+[(51-47.8571)2/47.8571)]=  0.5860 + 0.2344 + 0.5160 + 0.2064 = 1.5428

 

Yates corrected: |20(51)-16(39)| = 396

396> 126/2 >> Yates correction appropriate.

 

c2y= [(396-(126/2))2]126/[(36)(90)(59)(67)]=1.0909

 

Cochran-Haber corrected: fhatmin=16.8571

Corresponding fij=20=f

 

20<2(16.8571) >> use Cochran-Haber where D = 3

 

c2C=126332/[(36)(90)(59)(67)]=1.4057

 

STEP 8:

Since all c2OBS<c2CRIT we would fail to reject the null for all tests at a=0.10 and hence conclude that the distribution disease among ferrets is the same among the two geographic locations. 

 

Note that Yates < Cochran-Haber < uncorrected

 

b. For the ferrets of Area 1, estimate the probability of disease from the data.

 

p=20/59=0.3390

 

c. For the ferrets of Area 2, estimate the probability of disease from the data.

 

p=36/67=0.2388

 

d. For the ferrets of the two Areas combined, estimate the probability of disease from the data.

 

p=36/126=0.2857