Arnold, B, R.B. Corbett-Detig, D. Hartl, and K Bomblies. 2013. RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling. Molecular Ecology 22:3179-3190.
Researchers are collecting RADseq dataset to answer an array of difficult population genetic questions in non-model organisms, including assessing population structure, allele frequency estimation, genetic mapping, and selection. Arnold et al. investigate ascertain bias in RADseq data, a known issue, but one that has received little explicit attention. Because RADseq data is generated from whole genomic DNA that is cut with restriction enzymes at specific sites, the method is prone to lots of missing data. For instance, more distantly related species or populations may have substitutions in the restriction site, which will eliminate a particular RADloci for that species/population. Arnold et al. investigate this bias by simulating RADseq data under different one-population demographic scenarios. Because Arnold et al. have simulated data, they are able to generate two datasets – “true” and “estimated” – to compare the actual values of population genetic summary statistics to those that would be estimated in a RADseq experiment. The authors also take advantage of fully sequenced Drosophila melanogaster genomes to verify their simulation experiments with empirical genomic data. Arnold et al. argue that, because more distantly related lineages are likely to acquire substitutions at restriction sites, RADseq data will underestimate population genetic diversity (i.e. π, θ, Tajima’s D) as well as tmrca. Unfortunately, reducing your dataset to “complete sampling” is not a viable solution because loci with complete coverage (and no substitutions in the restriction site) are likely to represent regions of the genome with fewer substitutions and will therefore also underrepresent genetic diversity. Arnold et al. show that for demographic histories in which genetic diversity is reduced (population bottleneck, population expansion) this bias is less of an issue. The authors constructed an Allele Frequency Spectrum (AFS) for loci with different levels of missing data in order to show how there are systematic biases in the genealogical histories that are inferred from RAD data with varying levels of missing data. The author’s logic in conducting this analysis was lost on our reading group, and we felt the author’s would have done better to use a method that actually estimates genealogies. We also felt that the recommendations for dealing with RADseq ascertainment bias would not be very tractable. Arnold et al. rightly urged caution when selection restriction enzymes, as certain restriction site motifs are more likely to sample different genomic regions such as introns or exons, but it is unclear how feasible it is to use Approximate Bayesian Computation to correct for RADseq ascertainment bias. Following our discussion of the paper, Prof. Joe Felsenstein provided a simple and elegant mathematical correction for a theoretical π value estimated from RADseq data that considered the number of nucelotides in the restriction site and the probability that a substitution would occur there. Perhaps theoretical population genetic models will provide simple corrections for RADseq ascertainment bias. Arnold et al. also compare “true” and “estimated” values of Fst, and discovered that “estimated” Fst is elevated relative to the “true” value for loci with missing data. This result is especially problematic for genome scan studies that assume that loci with outlier Fst values are undergoing positive or divergent selection. To correct for this, researcher may chose to remove loci with missing data, but this will drastically reduce the dataset, making it unlikely that researchers will find genomic regions undergoing selection. Arnold et al. provide a timely critique of the RADseq protocol and, rightly so, urge researchers to consider the limitations of the method when designing their experiments. [UW Phylogenetics Seminar, 11/7/13; Matt McElroy]