Introduction to fitting multiple modes to distributions |
Here are two approaches for fitting multiple modes to univariate data: using histogram data and probability density functions.
The method of Macdonald & Green (1988) analyses histogram data which is the most frequent representation for plant population data. It finds the best fit for component distributions using chi-squared approximation to the likelihood ratio test using grouped data (Macdonald & Pitcher 1979; Macdonald 1987). Grouping data simplifies calculation of maximum likelihood estimates but Macdonald & Green warn against histogram groups with zero data or small numbers. The mixdist package, an implementation in R of the MIX package (Macdonald & Green 1988) can fit other component distribution types than just the normal including lognormal and gamma. mixdist calculates parameter values and their standard errors for each distribution in the mixture and chi-square for the fit between empirical histogram and the histogram calculated from the sum of the mixture distributions. mixdist can be used in exploratory analysis of numbers and types of component distributions that give effective fits. |
|
Using the R programming language you can investigate fitting a bimodal distribution to histogram data using mixdist. You must first install the package and then load it. The data used as an example is of plant height and plant weight of 250 dwarf marigold plants 42 days after sowing closely together. A histogram of the height data (shown at right) can be obtained by the R command: > hist(Ht.data) The scale of the x axis is centimeters and frequency is number of plants. The function hist calculates the number of bins into which values are sorted. This can be regulated using the breaks parameter of that function. |
In order to use mixdist the data for a histogram must be presented for the bins, sequentially and in increasing ordinate value, with the upper division value of a binfollowed by the number of values in that bin. The last bin must have with an upper division of infinity. On the left is a function that divides data into a specified number of bins and outputs the data in the form required by mixdist using the as.mixdata function. |
function(ddata, bins, resolution){ |
Th following code takes the output from the above function, mix.dist.hist, and provides a summary of the fit obtained. The output from plot(fitdummy1) is shown on the right. > fitdummy1 <-mix(mix.dist.hist, mixparam(mu=c(10, 15), sigma=c(1,1)),
"norm") Standard Errors: Analysis of Variance Table Df Chisq Pr(>Chisq) > plot(fitdummy1) |
The individual normal distributions are represented by dashed lines and their sum by the solid line. Estimates of the mean values are indicated by arrow head marks on the x axis |
The goodness-of-fit is the chi-squared approximation to the likelihood ratio statistic (see Users Guide for MIX). It is possible to fit different numbers of mixtures. An obvious case to try is that of a single distribution using the instruction: > fitdummy1 <-mix(mix.dist.hist, mixparam(mu=c(10, 15), sigma=c(1,1)),
"norm") Standard Errors: Analysis of Variance Table Df Chisq Pr(>Chisq) > plot(fitdummy1) |
Although the significance of the fit is greater than that for two modes the residuals between fitted line and histogram have a marked pattern with four central bins all being less than the fitted line while, on either side, there are bins with greater values than the fitted line. |
An alternative approach to fitting mixture distributions is based on probability density functions rather than histograms. Maximum likelihood estimates of parameters are found using an expectation-maximization (EM) algorithm (Leisch 2004) as implemented in the R package flexmix (Leisch and Gruen 2010). Models with different numbers of mixture components can be compared using the Akaike and Bayesian Information criteria. |
|
To use flexmix the data must be in the form of a data frame. The instruction: > fl <- flexmix(Ht.data~1, data = ht.data, k=2) With this method of calculation the mixture of two distributions is a better fit (having lower AIC and BIC values) than a single distribution. Everitt and Hothorn (2006) give R code for plotting the estimated distributions relative to a histogram of data. |
Return to "Methods and Software for Analysis of Population Structure"
References
Everitt, B.S. & Hothorn, T. (2006) A Handbook of Statistical Analyses Using R. Chapman & Hall, Boca Rotan. |
|
Leisch, F. (2004) A general framework for finite mixture modesl and latent class regression in R. Journal of Statistical Software, 11. http://www.jstatsoft.org/v11/i08/ |
|
Leisch, F., Gruen, B. (2010) flexmix: Flexible mixture modelling. The Comprehensive R Archive Network, http://cran.r-project.org/index.html. |
|
Macdonald, P.D.M. (1987) Analysis of length-frequency distributions. Age and Growth of Fish (eds R.C. Summerfelt & G.E. Hall), pp. 371-384. Iowa State University Press, Ames Iowa. |
|
Macdonald, P.D.M. & Green, P.E.J. (1988) User’s Guide to Program MIX: An interactive Program for Fitting Mixtures of Distributions. Ichthus Data Systems, Hamilton, Ontario. |
|
Macdonald, P.D.M. & Pitcher, T.J. (1979) Age-groups from size-frequency data: a versatile and efficient method of analysing distribution mixtures. Journal of the Fisheries Research Board of Canada, 36, 987-1001. |