[BioC] Granges averaging

Wed Oct 3 22:17:17 CEST 2012

Hi Yannick,

On 10/02/2012 10:07 AM, Yannick Wurm wrote:
> Hello,
>
> my Granges object "bigGranges" contains too much data. THat makes plotting extremely slow. Sometimes I want an averaged overview. I use the following code below to obtain it. Subsequently I can use ggbio for plotting.
> But I feel there must be an elegant one-line manner of doing this?

The most elegant manner would be that ggbio takes care of this for
you ;-)

The code you provide below is not self-contained (we don't know what
bigGranges and myscaffolds are). Also you don't explain what it's
supposed to do and/or what the output is supposed to look like.

Anyway, using 2 big nested "for" loops and growing the averageGranges
object at each iteration is in general very inefficient. The "R way" is
to take advantage of the fact that most arithmetic operations in R are
vectorized. Also the IRanges/GenomicRanges packages provide a lot of
arithmetic and other operations to let you work on RleViews,
RleViewsList, IRangesList, GRanges etc... objects in a vectorized
fashion.

But most importantly, the kind of average you compute seems questionable
to me. Right now, if a bin is for example the [500, 599] interval, and
if your bigGranges object has 2 ranges that fall within that bin e.g.
     start  end  myValue
       510  510      2.8
       500  549      3.8
then you will assign the non-weighted myValue average to the bin (i.e.
3.3). Is it really what you want? Or, given that the 1st range doesn't
count a lot, and the 2nd range covers half of the bin, wouldn't you
want the average to be close to 1.9?

Another (somewhat related) issue I see is that if a range in bigGranges
overlaps with 2 bins then you're counting it twice (because you're using
subsetByOverlaps()).

What about computing the weighted coverage of bigGranges (using the
myValue metadata column for the weights), and then the mean of this
coverage for a set of bins you specifiy? You would not only get rid
of the above issues, but it can also be implemented in a very
efficient way.

Here is what it would look like:

### First define the bins. The bins would typically be stored in a
### GRanges or IRangesList object. Here we choose the latter:
mybins <- lapply(seqlengths(bigGranges),
                  function(chrom_len)
                      IRanges(breakInChunks(chrom_len, 200L)))
mybins <- IRangesList(mybins)

### Then compute the coverage of your 'bigGranges', weighted by
### the values in its "myValue" metadata column:
cvg <- coverage(bigGranges, weight="myValue")

### Then put views (corresponding to the bins) on top of the
### RleList object returned by coverage(). The result is stored
### in a RleViewsList object:
viewslist <- RleViewsList(
                  lapply(names(cvg),
                      function(chrom)
                          Views(cvg[[chrom]], mybins[[chrom]])))
names(viewslist) <- names(cvg)

### Then call viewMeans() on 'viewslist':
bin_means <- viewMeans(viewslist)

### 'bin_means' is a NumericList object. It as the same "shape" as
### 'mybins' i.e. both are list-like objects of the same length and
### names, and the lengths of 'bin_means[[i]]' and 'mybins[[i]]' are
### the same for any valid i.
### Finally we construct 'averageGranges' from 'mybins' and 'bin_means':
averageGranges <- as(mybins, "GRanges")
mcols(averageGranges)$averagemyValue <- unlist(bin_means)

Hope this helps,
H.

>
> Thanks  & kind regards,
> yannick
>
>
>
> ## code is more legible on https://gist.github.com/c5a553bd012876effb48
>
>
> averageGranges  <- GRanges()
> interval_length <- 10000
>
> for (scaffold_index in 1:nrow(myscaffolds)) {
>      scaffold        <- myscaffolds$id[scaffold_index]
>      scaffold_length <- myscaffolds$to[scaffold_index]
>
>      num_intervals   <- scaffold_length / interval_length
>
>      for (interval_index in 0:(num_intervals-1) ) {
>         grangesInterval  <- GRanges(scaffold, IRanges(interval_index * interval_length,
>                                                    min( scaffold_length,
>                                                         (interval_index+1)* interval_length )))
>         subsetGranges <- subsetByOverlaps( bigGranges , grangesInterval)
>
>         grangesInterval$averagemyValue <-  mean(subsetGranges$myValue)
>         averageGranges <- append(averageGranges, grangesInterval)
>      }
> }
>
> # then I use averageGranges
>
> --------------------------------------
> Yannick Wurm http://yannick.poulet.org
> Ants, Genomes & Evolution ⋅ y.wurm at qmul.ac.uk ⋅ skype:yannickwurm ⋅ +44 207 882 3049
> 5.03A Fogg ⋅ School of Biological & Chemical Sciences ⋅ Queen Mary, University of London ⋅ Mile End Road ⋅ E1 4NS London ⋅ UK
> Easy custom BLAST interface: http://www.sequenceserver.com
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319