[Bioc-sig-seq] aggregating a RangedData over an IRanges object

Wed Jul 15 23:14:41 CEST 2009

Hi Patrick

Patrick Aboyoun wrote:
> I've made further changes to the IRanges package to make it easier to
> perform the coverage operations across chromosomes/contigs. I've added
> RleViewsList (virtual) and SimpleRleViewsList (non-virtual) classes to
> IRanges as well as a RleViewsList constructor and viewSums and other
> view* summary methods for RleViewList objects. The code below looks much
> cleaner now due to the RleViewsList constructor and the viewSums method.
> Also, I point out the viewApply function, which serves in the capacity
> of the aggregate function that you were experimenting with in your
> e-mails. Do these steps fit into your workflow?

Given this, it seems to me now that viewApply and coverage have very
similar use cases. Do we need both?

I am still quite worried that nobody outside the Hutch will be able to
figure out which class to use when. There are simply so many classes in
IRanges. Do you have a class hierarchy diagram?

A justification for all the RangedData-style classes is, of course, that
one usually deals with multiple chromosomes. So, I went through the
example of my last mail again, now with two chromosomes.

Last time, we had this here, a RangedData with a single, unnamed
sequence, and one column, 'score':

> v3 <- RangedData(
+    IRanges( start = c( 1, 4, 6 ), width = c( 3, 2, 4 ) ),
+    score = c( 2, 7, 3 ) )
> v3
RangedData: 3 ranges by 1 column on 1 sequence
colnames(1): score

Now, I attempt to make a RangedData with two sequences, 'chrA' and 'chrB':

> v3b <- RangedData(
+    RangesList(
+       chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
+       chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
+    score = c( 2, 7, 3, 1, 1, 1 ) )
> v3b
RangedData: 6 ranges by 0 columns on 2 sequences
colnames(0):
names(2): chrA chrB

As you can see, the 'score' argument was ignored. I've tried a couple of
other syntaxes, but could not get it in. What's the trick?

To proceed, I add my scores manually:

> v3b$score <- c( 2, 7, 3, 1, 1, 1 )

I still feel a bit uneasy about the fact that the split between the
sequences is not apparent here, but maybe it is simplest like this.

I now switch over to coverage vectors:

> v2b <- coverage( v3b, weight="score" )

This has now nicely assembled everything over both chromosomes:

> as.vector( v2b$chrA )
[1] 2 2 2 7 7 3 3 3 3
> as.vector( v2b$chrB )
[1] 1 1 2 1 1 1 1 1 1

Let's get some exons, two on chrA and one on chrB:

> exons <- RangesList(
+    chrA = IRanges( start = c(2, 4), width = c(2, 2) ),
+    chrB = IRanges( start = 3, width = 5 ) )

Now, 'aggregate' fails, with a rather misleading error message:

> aggregate( v2b, exons, sum )
Error in solveWindowSEW(length(x), start = ifelse(is.null(start), NA,  :
  Invalid sequence coordinates.
  Please make sure the supplied 'start', 'end' and 'width' arguments
  are defining a region that is within the limits of the sequence.

So, I'll try with views:

> v2exons <- RleViewsList( rleList=v2b, rangesList=exons )

> viewSums( v2exons )
SimpleIntegerList: 2 elements

> viewSums( v2exons )[[1]]
[1]  4 14
> viewSums( v2exons )[[2]]
[1] 6

This worked, but we lost the sequence names:

> viewSums( v2exons )[["chrA"]]
NULL
> viewSums( v2exons )$chrA
NULL

I suppose that we again rely here on the sequences being in the same
order in v2b and exons, i.e., if one has 'chrA' and 'chrB', and the
other, say, 'chrB' and 'chrC', this would not get spotted, or would it?

Sorry for the length nit-picking mails but this gets all a bit complex...

Cheers
  Simon