[BioC] a workflow of population genomic operations/analysis using bioconductor

Vincent Carey stvjc at channing.harvard.edu
Mon Dec 6 16:28:04 CET 2010

On Mon, Dec 6, 2010 at 9:28 AM, Mao Jianfeng <jianfeng.mao at gmail.com> wrote:
> Dear bioconductor listers,
> I just move from classical population genetics to genomics/population
> genomics. I need to set up my genomic handling platform and ability. I
> have used R for statistics for 3 years, so bioconductor is preferable
> to me.
> In my current study, we sequenced genomes of tens of accessions of a
> plant, by Illumina next generation sequencer. And, now the reads have
> been aligned with the reference genome.
> I have not any experiences of genomic analysis. On the beginning, I
> checked all the available packages for sequence analyses of the
> bioconductor, and read their manual. And also, I surveyed the courses
> in bioconductor websites. But, I still can not make a full and
> effective workflow for me to do population genomic analysis, though I
> have witnessed much excellent genomic implements of bioconductor.
> I think an effective workflow to do population genomic analysis by
> using R platform is very valuable for all of us who are/will be
> genomicers. Thank you for your helps in advance.

There are plenty of relevant workflow components on CRAN and in
Bioconductor.   My knowledge is not comprehensive but I give some tips
below.  Have you read the task view at

http://cran.r-project.org/web/views/Genetics.html   ?

> I need hints, tips, suggestions, and advice on making an explicit and
> effective workflow for me to do the following analysis by using
> bioconductor or maybe not:
> 1. mutation types. e.g. CG -> AT, CG -> TA etc. polarized with the
> relative genomes

This sounds like an analysis of pileup or mpileup results that could
be achieved through the combination of samtools and Rsamtools applied
to your illumina output.

> 2. Polymorphism along chromosomes (or scaffold)

Visualization of polymorphism events along chromosomes can be
accomplished using Rtracklayer, but you have to assemble the data

> 3. Polymorphism by type; intergenic, CDs etc.; and polymorphism by
> metabolic network

This depends upon combinations of data and annotation resources.  We
can obtain range data structures defining genomic regions as genic,
intergenic, exonic and so on using the GenomicFeatures package with
suitable reference annotation; read carefully the GenomicFeatures
vignette.  If your organism has reference sequence and annotation in
UCSC or EBI bioMart tables, you should be able to make progress
quickly.  Connecting this range-based information to your polymorphism
addresses can be accomplished with findOverlaps; connections to
networks of genes or other features requires clarification of the
objective and programming, but components of the ChIPPeakAnno package
would be relevant for relating addresses to higher-level functional

> 4. LD and recombination

see the task view; snpMatrix2 in bioconductor does deal with LD measures

> 5. drastic mutations. e.g. stop codons etc. in gene family, Gene Ontology
> 6. Population structure using STRUCTURE

There is no implementation of STRUCTURE for R that I know of, but the
clustering assignments could be added to the data for downstream
analysis fairly simply

> 7. Fst among groups
> 8. association studies

There are tools for Fst computation and various kinds of association
analysis in snpMatrix2; other relevant facilities are noted in the
task view mentioned above.

> --
> Jian-Feng, Mao
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list