Comments from Daniel Weise on determining Pr(k) The first unanswered question was "why isn't the Pr(K) that they showed in Figure x a posterior distribution?" The second unanswered question was "If Pr(K) isn't a posterior distribution, then what is it?" Pr(K) is not a posterior distribution because it isn't computed based on X, the observed genomes/loci. Equation (1), which gives the definition of Pr(Z, P | X), only defines the posterior distributions for Z and P. Pr(K) ain't anywhere in there, so it can't be posterior distribution. Instead, the distributions defined by Equation (1) are estimated separately for K = 1, K = 2, etc. So what is Pr(K), the metric for declaring the distribution of Z and P for a given K more likely than for a different K? Pr(K) is an ad hoc metric applied on top of the real posterior distributions for Z and P looking for distributions with given shapes. Simply put, distributions with clear peaks are deemed more probable than distributions without clear peaks. For example, if the distribution for Z and P is flat for K = 1, indicating no clear single population, but there are clear humps in the distribution for K = 2, then the calculation for Pr(K) will declare K = 2 much more likely than K = 1. On the other hand, if the distribution for K = 1 shows a clear peak, and that of K = 2 shows no peaks, then the calculation for Pr(K) will declare K = 1 much more likely than K = 2. Here's the mathematics behind the above paragraph. The first thing to remember is that the purpose of the random walk (MCMC algorithm) is not to find the top of the mountain, but to find the shape of the mountain, for we are trying to approximate the probability for every point in the space, and not simply find the most likely point. The shape of the mountain is approximated by the number of points sampled in a given area. The more samples in an area, the more likely the hypotheses in that area, and the higher the mountain there. This is the whole reason of ensuring that the Markov Chain created by the random walk has a "stationary" distribution; this requirement ensures that the distribution is approximated by the elements of the chain. So all the elements of the chain approximate/estimate the distribution by summing up elements that fall in a given area. Equations (12), (13), and (14), which define Pr(X|K), look at the shape of the mountain by computing the average (mean) and variance of how well each sampled point in the distribution explains the data X (the set of genomes/loci). Equation (12) computes the average of the negative log of how well each sampled hypothesis explains the data. This means that given a distribution D1 with a higher average than a distribution D2, D1 explains the data less well than D2 and the equations for Pr(K) will declare D1 less likely than D2). (Note that the average is not over the points in the distribution, i.e., the abscissa, but how well each point explains the data, that is, the ordinate.) Furthermore (now we get to Equation (14)), given that two distributions have the same average (explain the data equally well), one wants to prefer the distribution with sharper peaks, that is, the one with less variance. So (12) includes a term for variance (computed by (14)) -- the greater the variance of a distribution, the lower probably (12) assigns to the the distribution.