[BioC] how to go from an short read alignment file to a SNPs table for population genetic analysis

Vincent Carey stvjc at channing.harvard.edu
Mon Dec 6 18:40:27 CET 2010

On Mon, Dec 6, 2010 at 11:28 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> On Mon, Dec 6, 2010 at 9:54 AM, Mao Jianfeng <jianfeng.mao at gmail.com> wrote:
>> Dear Bioconductor listers,
>> I am new to genomics and bioinformatics. In my current study, we have
>> sequenced the genomes of tens of accessions of a plant, using Illumina
>> next generation sequencer. The short reads of a specific accession
>> have been aligned to the reference. The SNPs and shor indels have been
>> predicted for a specific accession genome to the reference. we got the
>> data sets for SNPs like the following format (in text file, the column
>> names were listed, the accession name will not change for a specific
>> accession):
>> <accession name><chromosome><position><reference base><cons
>> base><quality><support><concordance><avg_hits>
>> But usually, we need to align all the accessions in the following
>> format for classical population genetic analysis:
>> <accessions><SNP_1><SNP_2><SNP_3><SNP_...>
>> accession_1, a,t,g,,,
>> accession_2, a,t,c,,,
>> accession_3, t,a,c,,,
>> accession_,,,,,,,,,,,,,
>> I need to get helps, suggestions on how to do this format conversion,
>> or if there are any alternative choices for me, by using R and
>> bioconductor? If it need database operations, and how to do that?
>> Thanks in advance.
> Hi, Jianfeng.  You might save yourself some trouble by using a format such
> as VCF, something that is approaching an standard for reporting and
> databasing variants.  If you write a script to convert your variant format
> to a VCF, then combining them can be done with vcftools or potentially other
> tools dealing with VCF.

I will add here that there is very rudimentary code for transforming VCF to
SnpMatrix instances in the devel branch of GGtools: called vcf2sm

The intention is to speed the path from variant representations for
multiple subjects as given in the
1000 genomes files to structures analyzable with the snpMatrix2
facilities.  However, the specific
implementation in vcf2sm requires that system("tabix") works.
Rsamtools facilities for working
with bcf are also relevant but have not been connected to the
SnpMatrix representation yet.

> Sean
>        [[alternative HTML version deleted]]
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list