Exercise 6: Using EXCEL to plot and calculate statistics.

In this exercise we will use the capabilitiesof EXCEL to plot and calculate the statistics of a set of data.


Example 1. Sediment size data.


Grain size data are typically obtained by passing the (sediment) sample through a series of standard sieves in such a way that the fraction of the sample between sucessive sieve sizes can be determined by weighing. This process is in itself a mechanical way of measuring the statistical distribution of the size of the particles making up the sediment sample. The raw data are then size bins versus the fraction in the bin, eg

bin size(mm)----fraction ------ description

2-1----------------.1------------Very course sand

1-1/2--------------.2 ---------------course sand

1/2-1/4----------- .3 ------------- medium sand

1/4-1/8----------- .3 --------------- fine sand

1/8-1/16-----------.1--------------Very fine sand

1/16-1/32---------- 0 -------------- Course silt


Now it has been found empirically that most sediments are sorted in a log normal way. This has to do with the physics of the weathering and sorting process. This implies that if the data are plotted as fraction versus the log of the size range, then the data appear to be distributed in a Gaussian (normal) manner.

For this reason the data are converted to a phi scale by converting the size in mm to phi by the equation

phi = -log (size) (log is the base 2 log). ie size = 1/2^phi.

This is why the scales go in increments of 2 and not linearly. We also see that a size of 1 mm = 0 phi and that as phi increases the size decreases. Sizes greater than 1 mm have negative phi.

Now clearly the degree of sorting and the size ranges are going to characterize certain sediment types. So, for example, fine clays have a phi of 10-11, fine silts have a phi of 6-7, fine sands a phi of 2-3, and pebbles have a phi of -6 to -5. Within a range the degree of sorting is measured by the standard deviation and other statistical measures such a the cumulative percent.

Sample data

3 sets of data are listed below together with other relavant information.

size sample1 sample2 sampl3

mm weight percent

8-16 2 0 0

4-8 4 0 0

2-4 16 0 0

1-2 9 5 0

1/2-1 6 9 0

1/4-1/2 13 18 1

1/8-1/4 17 23 3

1/16-1/8 19 25 5

1/32-1/16 10 15 10

1/64-1/32 3 4 15

1/128-1/64 1 1 28

1/256-1/128 0 0 20

1/512-1/256 0 0 10

1/1024-1/512 0 0 6

1/2048-1/1024 0 0 2

1/4096-1/2048 0 0 0

Sample 1 is from the upper forshore, in surf zone.

Sample 2 is from a bench at about 30 M depth.

Sample 3 is from a deep water site at 1000 M depth.


What we want from the data

First thing is to convert from mm to phi size.

Then we need to make plots of histograms and cumulative frequency for each sample as a function of phi. .

From these data and plots we will be able to characterize and compare the samples and comment on the physical processes giving rise to the sample.

All of the plotting can be done with the EXCEL graphing option.



Statistics for Gaussian (or Normal) distributions.

Math background;

Let us measure some property, such as the number of cars passing a particular intersection from 0800 to 0815 each working day. We assume that the probability of having a particular number on any day is given by the Gaussian equation

f(x)= (1/sigma*sqr(2*pi))* exp(-0.5*(x-m)^2/sigma^2

where sigma is called the standard deviation

x would be the number of cars on any day

m is the mean number for all days = total number of cars for all days/ number of days(n)

f(x) is the probability of having the value x

and ^ means to square.

If exactly the same number of cars come by each day the standard deviation (sigma) will be very small. Conversly if the number of cars per day varied widely the standard deviation would be large.

You can see why this is from the definition sigma=sqr (sum(x-m)^2/n), where sum is over all days (n) .

Note that the VARIANCE is = sigma^2

It can be shown that 68.3 % of the distribution lies between the mean m+sigma and m-sigma and that 95 % lies between m-sigma*2 and m+sigma*2

Another statistic is called the skewness. It is defined by

skewness = (sum(x-m)^3/sigma^3)/n

The skewness is a measure of how different (or skewed ) the data are from a Gaussian distribution.

Kurtosis is defined by


and is a measure of how well the tails of the data fit a Gaussian distribution.

Other definitions are;

Mode: the data value with the most frequent occurrence

Median: That value of x for which half of the population is greater than and half less than x.

Example 2. Generating random data and getting statistics

Step 1. Use the random number generator to generate 128 random numbers between -1 and 1. That is 2*(RAND()-0.5)

Note: The random number generator RAND() generates volatile values. That is the values change very time a calculation is performed in the workbook. To make them non-volatile I suggest generating the set of numbers in a temporary worksheet, copying and pasting them to a word processor like NOTEPAD, saving the NOTEPAD data and then copying an pasting the data back to the EXCEL worksheet where yu want them.

Step 2 Use the Tools/data analysis/descriptive statistics to calculate the statistics for these data. You will see that it gives you the mean, mode, median, skewness, kurtosis and more for the data.

Step 3 Use the Tools/data analysis/histogram to generate a histogram of the data. You will see that all data values between 0 and 1 are equally probable (as one would hope). Also use the cumulative frequency option since this is useful information.

EXAMPLE 3. Generating gaussian data and getting the statistics

Step 1 To produce data having a gaussian distribution we can simply smooth the random data in example 1 by averaging over a few points and generate a new data series.

For example if the random data are a1:a128 then the gaussian data will be b1=sum(a1:a3)/3 and so forth for b2:b1126

Repeat the steps .2 and .3 in the example 1 to calculate the statistics and the histogram You should now see that the histogram and the statistical info now look much more like a gaussian distribition.

Keep trying this example with different averaging lengths and you will see how the standard deviation gets smaller and smaller as the averaging gets longer. Why is this ???