QSCI 482 HOMEWORK 2, DUE FRIDAY, OCTOBER 11, 2002

1. This problem is about the relationship between level of significance(alpha) and the "P-value" ["P" denotes probability]. For the example from Topic 1, "The Language of Singles Bars", the computed value of the test statistic was found to be 6.98. The "P-value" of .0305 in the notes was actually obtained via computer. Let's see how to get "close to it" using

ONLY Table B.1, and interpret the P-value.

a. Using ONLY Table B.1, find the shortest possible range for the probability exceeded by the value 6.98 for a chi-square distribution with 2 degrees of freedom. Your answer should look like, "L < P-value < U", where "L" is the lower bound for this probability and "U" is the upper bound.

a. 0.025<P(c2³ 6.98)<0.05

b. Now let's interpret. If the null hypothesis were true, the probability of seeing a test statistic as large as the one that we actually observed (6.98) is equal to what?--that's your answer to part [1a]. Now, compare your part [1a] answer to the specified level of significance for this test (.05), and explain how your answer to part [1a] lends evidence either for,

or against, the null hypothesis of uniformity among the 3 categories.

b. Assuming the null hypothesis were true, the probability of seeing a test statistic as large as 6.98 is between 0.025 and 0.05, exclusive (actually the exact probability is 0.0305 as stated in the introduction to this problem). When assessing the P-value, we reject the null hypothesis if P £ a, or we fail to reject the null if P > a. Here, since the range for the P-value is between 0.025 and 0.05 (exclusive), it is less than the level of significance (a=0.05), and we thus reject the null and conclude that the 3 categories are not uniform.

P-values are probabilities and are represented by areas under the Chi-square curve. The same is true for the level of significance (remember that “rejection regions” for these Chi-square tests are graphically depicted as the area equal to a in the right tail of the curve, here 0.05). Larger Chi-square test statistics are related to smaller P-values. As a matter of fact as soon as the Chi-square test statistic gets large enough that its related P-value passes the point that defines the edge of the “rejection region” (hence falling in the rejection region), the test statistic is “too large” (the related P-value is “too small”) and we would reject the null hypothesis.

Though not really clarified up to this point in the class, it is important to note that this really amounts to two different ways to do a hypothesis test. Why? Well, if using only the P-value approach, you never need to determine a critical value as we simply compare the P-value (derived from our test statistic) to the stated level of significance.

2. A biological oceanographer is studying the distribution of zooplankton in the water column. She wants to know if the zooplankton are uniformly

distributed in the mixed layer (an oceanography term we don't need to worry about). She has divided that layer into 5 sub-layers and counted the number of zooplankton occurring in each after releasing 60 zooplankton into a large mesocosm (large enough that the zooplankton may be considered independent and don't eat each other!).

The zooplankton distributed themselves as follows:

Sub-layer No. Zoopl.

Surface 06

Sub-surface 08

Mid 13

Lower Mid 15

Deep Mixed 18

a. If she analyzes this as a goodness-of-fit problem (treating the layers

simply as named categories and not considering their order), what are her

results? Include the p-value associated with your test statistic.

STEP 1:

H0 Biologically:

H0: The zooplankton are uniformly distributed in the mixed layer.

STEP 2:

Ha: The zooplankton are not uniformly distributed in the mixed layer.

STEP 3:

H0 Mathematically:

H0: P=(1/5)=P(surface)=P(sub-surface)=P(mid)=P(lower mid)=P(deep mixed).

Ha: At least one P¹(1/5).

STEP 4:

Choose test statistic:

Appropriate test is a Chi-square Goodness-of-Fit test with 4 degrees of freedom.

STEP 5:

Assumptions:

-Data are categorical.

-Sample observations are a random sample of the population and are independent.

STEP 6:

Significance level and critical value:

a=0.05. So, our critical value is c2_0.05;4=9.488.

STEP 7:

Compute test statistic:

Sub-layer fi fihat (fi- fihat)2/ fihat

Surface 06 12 3.0000

Sub-surface 08 12 1.3333

Mid 13 12 0.0833

Lower Mid 15 12 0.7500

Deep Mixed 18 12 3.0000

8.1666

STEP 8:

Since c²_OBS<c²_CRIT (8.166<9.488), at a=0.05 (0.05<P(c2³8.166)<0.10) we fail to reject the null hypothesis and hence conclude that the distribution of zooplankton is uniform across the mixed layer.

b. If she now recognizes that she has ordered categories and uses a test appropriate to ordered categories, what are her results? Include the

p-value here, too.

STEP 1:

H0 Biologically:

H0: The zooplankton are uniformly distributed in the mixed layer.

STEP 2:

Ha: The zooplankton are not uniformly distributed in the mixed layer.

STEP 3:

H0 Mathematically:

H0: P=(1/5)=P(surface)=P(sub-surface)=P(mid)=P(lower mid)=P(deep mixed).

Ha: At least one P¹(1/5).

STEP 4:

Choose test statistic:

Since data can be logically ordered, the Kolmogorov-Smirnov test (k=5) is appropriate.

STEP 5:

Assumptions:

-Categories are ordered

-n=multiple of k

-Data are categorical.

-Sample observations are a random sample of the population and are independent.

STEP 6:

Significance level and critical value:

a=0.05. (For critical value see Step 7).

STEP 7:

Compute test statistic:

Sub-layer fi fihat Fi Fihat |di|

Surface 06 12 6 12 6

Sub-surface 08 12 14 24 10

Mid 13 12 27 36 9

Lower Mid 15 12 42 48 6

Deep Mixed 18 12 60 60 0

dmax=10

Critical value:

(dmax)_0.05;5;60=9

STEP 8:

Since d_OBS>d_CRIT (10<9), at a=0.05 (0.02<P(dmax³10)<0.05) we reject the null hypothesis and hence conclude that the distribution of zooplankton is not uniform across the mixed layer.

c. Which was the "better" test to use and why?

Since the K-S test utilizes “extra” information (I.e. the ordered nature of the data), it is a better test to use in this case. By including this extra information, the K-S test is more likely to detect differences from uniformity than is the chi-square test.

3. The following data are frequencies of ferrets in two geographic areas, with and without a particular disease. Test the null hypothesis (use the .10 level of significance) that the prevalence of the disease is the same in both areas. Do this in three ways: compute X2 , X2 (Yates) and X2(Cochran-Haber). Compare the three values of the test statistic and comment on their relationship to each other.

AREA WITH DISEASE WITHOUT DISEASE

Area 1 20 39

Area 2 16 51

STEP 1:

H0 Biologically:

H0: The prevalence of disease for ferrets is the same in the two geographic areas.

STEP 2:

Ha: The prevalence of disease for ferrets is not the same in the two geographic areas.

STEP 3:

H0 Mathematically:

H0: p_ij=p_i.p_.j for i=1…r, j=1…r

Ha: At least one p_ij¹p_i.p_.j for i=1…r, j=1…r

STEP 4:

Choose test statistic:

We will use a Chi-square test statistic, however in addition to uncorrected Chi-square, Yates corrected and Cochran-Haber corrected test statistics will be calculated since df=1.

STEP 5:

Assumptions:

-Sample observations are independent.

-The data are collected as a random sample.

STEP 6:

Significance level and critical value:

a=0.10. Critical value: c²_0.10;1=2.706

STEP 7:

Compute test statistic:

Area w/disease w/o disease

Area 1 20 39

(16.8571) (42.1429)

Area 2 16 51

(19.1429) (47.8571)

uncorrected: c²=[(20-16.8571)²/16.8571]+[(39-42.1429)²/42.1429]+[(16-19.1429)²/19.1429]+[(51-47.8571)²/47.8571)]= 0.5860 + 0.2344 + 0.5160 + 0.2064 = 1.5428

Yates corrected: |20(51)-16(39)| = 396

396> 126/2 >> Yates correction appropriate.

c²_y= [(396-(126/2))²]126/[(36)(90)(59)(67)]=1.0909

Cochran-Haber corrected: fhat_min=16.8571

Corresponding f_ij=20=f

20<2(16.8571) >> use Cochran-Haber where D = 3

c²_C=126³3²/[(36)(90)(59)(67)]=1.4057

STEP 8:

Since all c²_OBS<c²_CRIT we would fail to reject the null for all tests at a=0.10 and hence conclude that the distribution disease among ferrets is the same among the two geographic locations.

Note that Yates < Cochran-Haber < uncorrected

b. For the ferrets of Area 1, estimate the probability of disease from the data.

p=20/59=0.3390

c. For the ferrets of Area 2, estimate the probability of disease from the data.

p=36/67=0.2388

d. For the ferrets of the two Areas combined, estimate the probability of disease from the data.

p=36/126=0.2857