This has been reposted from our lab tumblelog

This workflow outlines analysis of data from mapping bisulfite treated DNA (BS-seq) to the oyster genome.

Initially sequencing reads, in this case from oyster sperm, are mapped to the genome using BS-MAP.

./bsmap -a /Volumes/betty/filtered_174gm_A_NoIndex_L006_R1.fastq.gz -b /Volumes/betty/filtered_174gm_A_NoIndex_L006_R2.fastq.gz -d /Volumes/betty/oyster_v9.fa -o /Volumes/betty/BiGO_Betty_plain.sam -p 4

Total number of aligned reads:
pairs: 90102067 (53%)
single a: 17033002 (9.9%)
single b: 15975991 (9.3%)

Resulting SAM file then subjected to methratio python script

python -d /Volumes/betty/oyster_v9.fa -u -z -g -o  /Volumes/betty/BiGO_betty_plain_methratio_v1.txt -s /Volumes/Bay3/Software/BSMAP/bsmap-2.73/samtools /Volumes/betty/BiGO_Betty_plain.sam

total 167774088 valid mappings, 127192776 covered cytosines, average coverage: 13.23 fold.

This file was uploaded to SQLShare using python script

python c8APIKEYAPIKEYAPIKEY5c15c /Volumes/web/cnidarian/BiGO_betty_plain_methratio_v1.txt


Interested in is how many CpG loci is there information for?

The following query will get all lines where CG are in the 3rd and 4th position.

SELECT * FROM [].[BiGO_betty_plain_methratio_v1.txt] betty where context like '__CG_' --_=single character wildcard

To count the number of instances in this query

How to convert methratio file to gff format

In this case will make GFF only of CpGs, where there is at least 10x coverage.

chr as seqname,  
'methratio' as source,  
'CpG' as feature, 
pos as start,
pos + 1 as [end],
ratio as score,  
'.' as frame,  
'.' as attribute  

FROM [].     
[BiGO_betty_plain_methratio_v1.txt] betty 
context like '__CG_' --_=single character wildcard
CT_Count > 9`

via Tidal Cycles –

via the Lab Tumblr:

Workflow: Analysis of Oyster DNA Methylation Data