[Bioc-sig-seq] aggregating a RangedData over an IRanges object

Thu Jul 16 20:40:55 CEST 2009

Simon,
I have enhanced the aggregate and seqextract functions to accept 
RangesList object for subscript selection within AtomicList objects, 
which include RleList, IntegerList, NumericList, etc. On performance 
grounds, I still recommend using the viewSums approach for calculating 
AUC because it performs its looping at the C level, rather than the R 
level as aggregate does.

Keep the requests coming. Much of the IRanges package has been developed 
for general usage (I believe it could be a major player in time series 
analysis as well if developers that that community were interested in 
leveraging its infrastructure) and, as this week has shown, there is 
still work to do to support biological sequence analysis. Right now we 
are working on the concepts of Views and RangedData. In the future, I 
have a feeling we will need to beef up the representation of alignments, 
multiple alignments, motifs (in an abstract model sense), and other 
areas so we can have interchangeable special-purpose Bioconductor 
packages that work coherently to form a whole.

 > suppressMessages(library(IRanges))
 > track <- RangedData(
+      RangesList(
+         chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
+         chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4))),
+      score = c( 2, 7, 3, 1, 1, 1 ) )
 > trackCoverage <- coverage(track, weight="score")
 > exons <- RangesList(
+    chrA = IRanges(start = c(2, 4), width = c(2, 2)),
+    chrB = IRanges(start = 3, width = 5))
 > aggregate(trackCoverage, exons, sum)
SimpleList: 2 elements
names(2): chrA chrB
 > aggregate(trackCoverage, exons, sum)[["chrA"]]
[1]  4 14
 > exonView <- RleViewsList(rleList=trackCoverage, rangesList=exons)
 > viewSums(exonView)
SimpleIntegerList: 2 elements
names(2): chrA chrB
 > identical(aggregate(trackCoverage, exons, sum)[["chrA"]],
+                 viewSums(exonView)[["chrA"]])
[1] TRUE
 > sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] IRanges_1.3.37

Patrick Aboyoun wrote:
> Simon,
> I have checked in changes to IRanges (now version 1.3.36), BioC 2.5 
> (devel), that improve the RangedData constructor to accept a 
> RangesList object for its ranges object as well as added passed the 
> element names when performing view* operations. I'm going to consider 
> the aggregate use case some more and get back to you shortly on that 
> one. Here is a revised example:
>
> > suppressMessages(library(IRanges))
> > track <- RangedData(
> +     RangesList(
> +        chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
> +        chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
> +     score = c( 2, 7, 3, 1, 1, 1 ) )
> > track
> RangedData: 6 ranges by 1 column on 2 sequences
> colnames(1): score
> names(2): chrA chrB
> > trackCoverage <- coverage(track, weight="score" )
> > trackCoverage
> SimpleRleList: 2 elements
> names(2): chrA chrB
> > exons <- RangesList(
> +   chrA = IRanges( start = c(2, 4), width = c(2, 2) ),
> +   chrB = IRanges( start = 3, width = 5 ) )
> > exons
> SimpleRangesList: 2 elements
> names(2): chrA chrB
> > exonView <- RleViewsList( rleList=trackCoverage, rangesList=exons )
> > exonView
> SimpleRleViewsList: 2 elements
> names(2): chrA chrB
> > viewSums(exonView)
> SimpleIntegerList: 2 elements
> names(2): chrA chrB
> > viewSums(exonView)[["chrA"]]
> [1]  4 14
> > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
> i386-apple-darwin9.7.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base   
> other attached packages:
> [1] IRanges_1.3.36
>
>
>
> Patrick Aboyoun wrote:
>> Simon,
>> Answering your issues in turn:
>>
>> Issue 1) Given this, it seems to me now that viewApply and coverage 
>> have very similar use cases. Do we need both?
>>
>> Reponse 1) Yes. The viewApply function is a general purpose tool for 
>> implementing a function over a set of Views. The coverage function 
>> performs a (weighted) depth count of interval/ranges over a chosen 
>> domain. The viewSums, viewMaxs, etc. are specializations of the 
>> viewApply function that are built for performance reasons, much like 
>> the colSums, etc. functions are for matrices.
>>
>>
>> Issue 2) I am still quite worried that nobody outside the Hutch will 
>> be able to figure out which class to use when. There are simply so 
>> many classes in IRanges. Do you have a class hierarchy diagram?
>>
>> Response 2) As a first step Michael and I have fleshed out the 
>> IRanges vignette
>>
>> http://bioconductor.org/packages/2.5/bioc/vignettes/IRanges/inst/doc/IRangesOverview.pdf 
>>
>>
>> to explain more about the IRanges make up. The next step in the 
>> documentation is to map the entities in biological sequence analysis 
>> to IRanges classes. The chipseq package used an amorphous GenomeData 
>> class for this purpose, but that is just a glorified list that in one 
>> context can store coverage information and in another context can 
>> store exon boundaries. I have been trying to revise the chipseq 
>> package to be more specific on what data type should represent what 
>> sequence analysis entity. So for example, an RleList object is a 
>> natural representation of mapped read coverage information across 
>> chromosomes/contigs. If you want to associate exon boundaries to 
>> coverage data, then RleViewsList objects serve as a natural 
>> representation. Once we list out all of the sequence analysis 
>> entities, we can create a two column table mapping the entity to the 
>> Bioconductor class. We can then create something akin to a UML 
>> sequence diagram to show how you get from one object type to another 
>> in a typical sequence analysis workflow.
>>
>>
>> Issue 3) The RangedData constructor isn't intuitive. The following 
>> command
>>
>> v3b <- RangedData(
>>     RangesList(
>>     chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
>>    chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
>>    score = c( 2, 7, 3, 1, 1, 1 ) )
>>
>> doesn't work as expected.
>>
>> Response 3) The constructor wasn't general enough. I will fix this 
>> shortly.
>>
>>
>> Issue 4) The aggregate method used for *List objects aren't 
>> vectorized over list elements and the error message isn't useful.
>>
>> Reponse 4) I need to think about this one for a bit because one could 
>> feasibly wish to aggregate across list elements and not within them. 
>> Perhaps I could check the nature of the by argument and if it is a 
>> "list", then aggregate within elements. Otherwise, aggregate between 
>> elements as it currently does.
>>
>>
>> Issue 5) The view* functions don't keep element names from original 
>> object.
>>
>> Reponse 5) This was an oversight on my part and I will correct this 
>> shortly.
>>
>>
>> I'll send out an e-mail to the group once I check in these changes.
>>
>>
>>
>> Patrick
>>
>>
>>
>>
>> Simon Anders wrote:
>>> Hi Patrick
>>>
>>> Patrick Aboyoun wrote:
>>>  
>>>> I've made further changes to the IRanges package to make it easier to
>>>> perform the coverage operations across chromosomes/contigs. I've added
>>>> RleViewsList (virtual) and SimpleRleViewsList (non-virtual) classes to
>>>> IRanges as well as a RleViewsList constructor and viewSums and other
>>>> view* summary methods for RleViewList objects. The code below looks 
>>>> much
>>>> cleaner now due to the RleViewsList constructor and the viewSums 
>>>> method.
>>>> Also, I point out the viewApply function, which serves in the capacity
>>>> of the aggregate function that you were experimenting with in your
>>>> e-mails. Do these steps fit into your workflow?
>>>>     
>>>
>>> Given this, it seems to me now that viewApply and coverage have very
>>> similar use cases. Do we need both?
>>>
>>> I am still quite worried that nobody outside the Hutch will be able to
>>> figure out which class to use when. There are simply so many classes in
>>> IRanges. Do you have a class hierarchy diagram?
>>>
>>> A justification for all the RangedData-style classes is, of course, 
>>> that
>>> one usually deals with multiple chromosomes. So, I went through the
>>> example of my last mail again, now with two chromosomes.
>>>
>>> Last time, we had this here, a RangedData with a single, unnamed
>>> sequence, and one column, 'score':
>>>
>>>  
>>>> v3 <- RangedData(
>>>>     
>>> +    IRanges( start = c( 1, 4, 6 ), width = c( 3, 2, 4 ) ),
>>> +    score = c( 2, 7, 3 ) )
>>>  
>>>> v3
>>>>     
>>> RangedData: 3 ranges by 1 column on 1 sequence
>>> colnames(1): score
>>>
>>> Now, I attempt to make a RangedData with two sequences, 'chrA' and 
>>> 'chrB':
>>>
>>>  
>>>> v3b <- RangedData(
>>>>     
>>> +    RangesList(
>>> +       chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
>>> +       chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
>>> +    score = c( 2, 7, 3, 1, 1, 1 ) )
>>>  
>>>> v3b
>>>>     
>>> RangedData: 6 ranges by 0 columns on 2 sequences
>>> colnames(0):
>>> names(2): chrA chrB
>>>
>>> As you can see, the 'score' argument was ignored. I've tried a 
>>> couple of
>>> other syntaxes, but could not get it in. What's the trick?
>>>
>>> To proceed, I add my scores manually:
>>>
>>>  
>>>> v3b$score <- c( 2, 7, 3, 1, 1, 1 )
>>>>     
>>>
>>> I still feel a bit uneasy about the fact that the split between the
>>> sequences is not apparent here, but maybe it is simplest like this.
>>>
>>> I now switch over to coverage vectors:
>>>
>>>  
>>>> v2b <- coverage( v3b, weight="score" )
>>>>     
>>>
>>> This has now nicely assembled everything over both chromosomes:
>>>
>>>  
>>>> as.vector( v2b$chrA )
>>>>     
>>> [1] 2 2 2 7 7 3 3 3 3
>>>  
>>>> as.vector( v2b$chrB )
>>>>     
>>> [1] 1 1 2 1 1 1 1 1 1
>>>
>>> Let's get some exons, two on chrA and one on chrB:
>>>
>>>  
>>>> exons <- RangesList(
>>>>     
>>> +    chrA = IRanges( start = c(2, 4), width = c(2, 2) ),
>>> +    chrB = IRanges( start = 3, width = 5 ) )
>>>
>>> Now, 'aggregate' fails, with a rather misleading error message:
>>>
>>>  
>>>> aggregate( v2b, exons, sum )
>>>>     
>>> Error in solveWindowSEW(length(x), start = ifelse(is.null(start), 
>>> NA,  :
>>>   Invalid sequence coordinates.
>>>   Please make sure the supplied 'start', 'end' and 'width' arguments
>>>   are defining a region that is within the limits of the sequence.
>>>
>>> So, I'll try with views:
>>>
>>>  
>>>> v2exons <- RleViewsList( rleList=v2b, rangesList=exons )
>>>>     
>>>
>>>  
>>>> viewSums( v2exons )
>>>>     
>>> SimpleIntegerList: 2 elements
>>>
>>>  
>>>> viewSums( v2exons )[[1]]
>>>>     
>>> [1]  4 14
>>>  
>>>> viewSums( v2exons )[[2]]
>>>>     
>>> [1] 6
>>>
>>> This worked, but we lost the sequence names:
>>>
>>>  
>>>> viewSums( v2exons )[["chrA"]]
>>>>     
>>> NULL
>>>  
>>>> viewSums( v2exons )$chrA
>>>>     
>>> NULL
>>>
>>> I suppose that we again rely here on the sequences being in the same
>>> order in v2b and exons, i.e., if one has 'chrA' and 'chrB', and the
>>> other, say, 'chrB' and 'chrC', this would not get spotted, or would it?
>>>
>>>
>>> Sorry for the length nit-picking mails but this gets all a bit 
>>> complex...
>>>
>>> Cheers
>>>   Simon
>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing