I learned how to use RMarkdown, and played around with different microsatellite data analysis programs/metrics etc. in R.

Check out the below script, also available in my Oly_genetics repo

I also used the knitr option to create an HTML file from the RMarkdown file, which knits together all code, output, and plots. It looks great! However, how do I simply post this to my lab notebook?

title: “Oly-Genetics-NF-Testing”
author: “Laura H Spencer”
date: “January 17, 2018”
output: html_document

“`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)<br />

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
## Oly Genetics Analysis, playing around with data

Today I played with the 2016/2017 Oly microsatellite data in R, generating some summary statistics by following a few online tutorials. What I've learned: * The spreadhseet "Olympic Oyster NFH_NFW (1).xlsx" from Crystal houses data from NF population (Fidalgo Bay) from two populations: 2016 hatchery-produced (F1), and 2017 wild. Crystal already determined bins, rounded the microsatellite data to produce: 100 samples with diploid genotype data across 6 loci (2 alleles per loci).
* microsatellite data is often in a particular spreadsheet format, whith metadata in the first 2 columns, and summary data in the first 2 rows. I exported the rounded data from Crystal's spreadsheet, modified to conform with the required format, and saved as 3 separate files:
1. NFH-2016 alone
2. NFW-2017 alone
3. NFH-2016 & NFW-2017 merged
* NOTE: There are additional microsatellite data from 2015, available online from Rick/Andy: http://ift.tt/2mSobCI

My github repo: http://ift.tt/2mSQTn7

“`{r, warning=FALSE, include=TRUE}
# Import 2016/2017 microsatellite data
# Used the following reference: http://ift.tt/2BbVvJt

NFH.2016 <- read.genalex(“Data/Oly2016NFH_Rounded.csv”, ploidy=2)
NFW.2017 <- read.genalex(“Data/Oly2017NFW_Rounded.csv”, ploidy=2)
NF <- read.genalex(“Data/Oly2016NFH+2017NFW_Merged.csv”, ploidy=2)

summary(NFW.2017) #summary of wild samples
summary(NF) #summary of hatchery and wild combined 
info_table(NF, type="missing", plot=TRUE) #see how the missing data is distributed over the 2 populations
mlg.table(NF) #genotype eveness. Result is N=199; MLG=199

Generate more summary statistics using poppr function

Abbreviations used in t Statistic

  • Pop: Population name.
  • N: Number of individuals observed.
  • MLG: Number of multilocus genotypes (MLG) observed.
  • eMLG: The number of expected MLG at the smallest sample size ≥ 10 based on rarefaction
  • SE: Standard error based on eMLG.
  • H: Shannon-Wiener Index of MLG diversity (Shannon, 2001).
  • G: Stoddart and Taylor’s Index of MLG diversity (Stoddart & Taylor, 1988).
  • lambda: Simpson’s Index (Simpson, 1949). 0 = no genotypes are differet; 1 = all genotypes are different
  • E.5: Evenness, E5E5 (Pielou, 1975; Ludwig & Reynolds, 1988; Grünwald et al., 2003).
  • Hexp: Nei’s unbiased gene diversity (Nei, 1978).
  • Ia: The index of association, IAIA (Brown, Feldman & Nevo, 1980; Smith et al., 1993).
  • rbarD: The standardized index of association, r¯dr¯d [@].
NF.pop <- poppr(NF) #summary stats on each population
(NF.pop$N / (NF.pop$N - 1)) * NF.pop$lambda #corrected simpson's index (N/(N-1)) #all different genotypes

Are our populations in Hardy-Weinberg equilibrium?

Hardy‐Weinberg Assumptions include:
• infinite population
• discrete generations
• random mating
• no selection
• no migration in or out of population
• no mutation
• equal initial genotype frequencies in the two sexes
Equilibrium is reached after one generation of mating under the Hardy‐Weinberg assumptions…Genotype frequencies remain the same from generation to generation.

NF.HW <- seppop(NF) %>% lapply(hw.test, B=1000) #all P-values >0.05; reject the null that these populations are under HWE

Hardy-Weinberg test results: reject the null that these populations are under HWE for all 6 loci. Here is a table with p-values for all loci, for each population:

NF.HW.P <- sapply(NF.HW, "[", i=TRUE, j=3) #pvalues of HW chi-squared test for all loci, both pops combined into a dataframe

Are populations in linkage disequilibrium?

Test the null hypothesis that alleles observed at different loci are not linked. This is the case if populations are sexual while alleles recombine freely into new genotypes during the process of sexual reproduction:
IA =VO/VE -1
… where V0 is the observed variance of K and VE is the expected variance of K, where K is the number of loci at which two individuals differ.

NF.ia.H <- ia(popsub(NF, "NFH-2016"), sample=999)
NF.ia.W <- ia(popsub(NF, "NFW-2017"), sample=999)

Since P < 0.001, we find significant support for the hypothesis that alleles are linked across loci.

Let’s try to figure out which alleles are linked via pairwise assessment:

NF.W2017.pair <- pair.ia(popsub(NF, "NFW-2017"), quiet=F, plot=F)
NF.H2016.pair <- pair.ia(popsub(NF, "NFH-2016"), quiet=F, plot=F)
pair.range <- range(c(NF.W2017.pair, NF.H2016.pair), na.rm=TRUE)

Check out the pair.ia results for each population. From the below plots, it looks like loci 13, 15 & 19 are possibly linked

plot(NF.W2017.pair, limits=pair.range, main="NFW-2017 Index of Association Pair Comparison")

plot(NF.H2016.pair, limits=pair.range, main="NFH-2016 Index of Association Pair Comparison")

Review frequencies of each allele and plot frequency of hatchery vs. wild:

NF.freq <- rraf(NF, by_pop=TRUE)
NF.freq.t <- t(NF.freq)

Testing out another tutorial & R package to generate statistics

Source: http://ift.tt/2mQXkH8


First generate stats on NFW-2016 population & plot Observed vs. Expected Heterozygosity

NF.summary <- summary(NF)
NFW.2017.summary <- summary(popsub(NF, "NFW-2017"))
plot(NFW.2017.summary$Hobs, xlab="Loci number", ylab="Observed Heterozygosity", 
     main="Observed heterozygosity per locus, NFW 2017")
plot(NFW.2017.summary$Hobs, NFW.2017.summary$Hexp, xlab="Observed Heterozygosity", ylab="Expected Heterozygosity", 
     main="Expected ~ Observed Heterozygosity per locus, NFW 2017")
bartlett.test(list(NFW.2017.summary$Hexp, NFW.2017.summary$Hobs)) #indicates no difference between mean observed and expected heterozygosity 

Run same stats on the NFH-2016 population

NFH.2016.summary <- summary(popsub(NF, "NFH-2016"))
plot(NFH.2016.summary$Hobs, xlab="Loci number", ylab="Observed Heterozygosity", 
     main="Observed heterozygosity per locus, NFH 2016")
plot(NFH.2016.summary$Hobs, NFH.2016.summary$Hexp, xlab="Observed Heterozygosity", ylab="Expected Heterozygosity", 
     main="Expected ~ Observed Heterozygosity per locus, NFH 2016")
bartlett.test(list(NFH.2016.summary$Hexp, NFH.2016.summary$Hobs)) #indicates no difference between mean observed and expected heterozygosity 

Calculate and plot allelic frequency

Modified from the following script: http://ift.tt/2Bd2Xnj

# I need the data as a normal dataframe, so I re-read the data into R as .csv
NF.csv <- read.csv(file="Data/Oly2016NFH+2017NFW_Merged.csv", stringsAsFactors = F)
names(NF.csv) <- NF.csv[2,]
NF.csv <- NF.csv[-1:-2,]
NFW.df <- subset(NF.csv, NF.csv$Population == "NFW-2017") #subset the wild pop
NFH.df <- subset(NF.csv, NF.csv$Population == "NFH-2016") #subset the hatchery pop
NFW.df.1 <- NFW.df[,-1:-2] #remove metadata 
NFH.df.1 <- NFH.df[,-1:-2] #remove metadata

Write a function to calculate allelic frequency and produce a dataframe with results

allelic.freq <- function(df) {
  L=ncol(df)   #how many columns are there?
  locus_positions=(2*(unique(round((1:(L-2))/2)))+1)   #find the starting column number for each locus
  lnames=colnames(df)                          #locus names, from the header
  OUT=list()        #create a null dataset to append allele freqs to
  for (x in locus_positions) {                       #begin for loop, to calculate frequencies for each locus
    alleles=c(df[,x],df[,x+1])        #For example, combine columns 1 and 2 for locus 1 (two columns because they are diploid)
    alleles2=as.data.frame(table(alleles))             #count each allele at locus x
    missing=alleles2[which(alleles2[,1]==0),2]          #count missing data at locus x, entered as '0' in this dataset (not used further for simplicity)
    alleles3=alleles2[which(!alleles2[,1]==0),]          #remove missing data (otherwise 0 would be counted in total number of alleles)
    alleles4=cbind(alleles3,alleles3[,2]/sum(alleles3[,2])) #calculate frequencies
    output=cbind(x,lnames[x],alleles4)                        #combine x, locusname, and frequencies
    OUT[[x]] <- output
  OUT.1 <- do.call(rbind, OUT)
  colnames(OUT.1) <- c("Number","Locus","allele","count","frequency") #add column headers

Run the allelic.freq function on thie hatchery and wild populations seperately, then save to txt files

NFH.afreq <- allelic.freq(NFW.df.1)[,-1]
NFW.afreq <- allelic.freq(NFH.df.1)[,-1]

Write a function to generate allelic frequency plots

Freq.Plots <- function(wanted_locus, frequency_table, population) {
  plot(as.numeric(as.character(Locus[,2])),as.numeric(as.character(Locus[,4])),xlab="Allele",ylab="Frequency",main=paste(population, "_", "Locus_",Locus[1,1]),pch=21,bg="blue",cex=1.5)
  plot(1:length(Locus[,2]),sort(as.numeric(as.character(Locus[,4])),decreasing=TRUE),xlab="Allele (orderd by frequency)",ylab="Frequency",main=paste(population, "_", "Locus_",Locus[1,1], sep=""),pch=21,bg="blue",cex=1.5)

Generate allelic frequency plots for each loci, for each population:

# Generate plots for NFH-2016 population
loci <- unique(NFH.afreq$Locus)
for (i in 1:length(loci)) {
  path <- file.path(paste("Analyses/", "FreqPlots_NFH-2016_", loci[i], ".pdf", sep = ""))
  Freq.Plots(loci[i], NFH.afreq, "NFH-2016")
# Here are example plots: 
Freq.Plots("Olur19", NFH.afreq, "NFH-2016")
# Generate plots for NFW-2017 population
loci <- unique(NFW.afreq$Locus)
for (i in 1:length(loci)) {
  path <- file.path(paste("Analyses/", "FreqPlots_NFW-2017_", loci[i], ".pdf", sep = ""))
  Freq.Plots(loci[i], NFW.afreq, "NFW-2017")
# Here are example plots: 
Freq.Plots("Olur19", NFW.afreq, "NFW-2017")

from LabNotebook http://ift.tt/2mQRTrZ

Oly Genetics Analysis, playing around with data
Tagged on: