[BioC] comparing multiple individual genomes in bioc?
Martin Morgan
mtmorgan at fhcrc.org
Wed Jan 28 00:25:07 CET 2009
Kasper Daniel Hansen wrote:
> This is a rather difficult question to answer because no-one has really
> done what you are proposing to do. It is true that many people hope to
> be able to do something like this, but it is unclear how much data we
> are talking about, what form the data is in and finally what kind of
> stuff we want to do with the data. Without a more clear specification it
> is pretty hard to answer this.
>
> There is for example a difference in whether the data is just a list of
> SNPs, a collection of genomes in FASTA format or a (big) collection of
> short reads that needs to be assembled.
>
> Given that no-one has done this, it is clear that the first attempts
> will involve a lot of custom code, so don't expect any off-the-shelf
> method for any suite of software. I will say that Bioconductor is a
> priori not more nor less suitable for this than any other piece of
> software.
Hmm. Some of the Bioc infrastructure might make for great building
blocks. The BSgenome package has facilities for dealing with
genome-scale data, especially reference genomes. Biostrings tools for
custom and comparatively fast pattern matching seem very suitable to
exploratory analysis. The conceptual foundation of IRanges and Rle
classes seem well-suited to efficient representation of genome-scale
'features of interest' coupled with a flexibility for investigating
genome-scale questions. For instance, if SNPs were represented as RLEs,
it would be straight-forward to summarize site-specific SNP abundance
(just add the RLEs using '+') in memory-efficient ways. Likewise the
overlap function of IRanges might provide a very useful tool for rapidly
filtering per-genome annotations to identify features that are shared
across samples. Plus the usual arguments for R viz., established
statistical and visualization tools and ready interface to data bases,
web resources, etc.
Agreed though that the question is open ended and therefore hard to
answer. It would be really exciting to here use cases sketched out, here
or on the bioc-sig-sequencing mailing list.
Martin
> Kasper
>
> On Jan 27, 2009, at 9:21 , Paul Shannon wrote:
>
>> I am becoming acquainted with the bioc packages helpful in DNA
>> sequence analysis. There is lots of nice stuff.
>>
>> We (like the rest of the world...) hope to soon have many individual
>> human genomes. We wish to compare them, looking for fine-grained
>> variations in intragenic and extragenic regions, for clues to
>> phenotypic variety.
>>
>> Is there support for this kind of analysis in bioc? If not, is this
>> planned or hoped for?
>>
>> Thanks -
>>
>> - Paul Shannon
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793
More information about the Bioconductor
mailing list