[BioC] Transformation of Grange object to density per bin

Wed Jan 23 02:45:19 CET 2013

Hi,

This looks very much like this other thread (subject: GRanges
averaging) from last October:

   https://stat.ethz.ch/pipermail/bioconductor/2012-October/048338.html

Actually Hermann's "density" problem can be seen as a particular case
of the "average per bin" problem where the metadata column to average
contains only 1's.

Here is a proposal for a function that tries to solve the "average per
bin" problem (not sure averagePerBin is a good name for this function):

### 'x': a GenomicRanges objects with non-NA seqlengths.
### 'binsize': a single positive integer.
### 'mcolnames': names of numeric metadata columns in 'x' to "average"
###              i.e. to propagate to the result after averaging them
###              on each bin.
### Returns a GRanges object with: (a) the same seqinfo as 'x',
### (b) ranges of width 'binsize' covering all the sequences in
### 'seqinfo(x)', and (c) the "averaged" metadata columns specified
### in 'mcolnames'.

averagePerBin <- function(x, binsize, mcolnames=NULL)
{
     if (!is(x, "GenomicRanges"))
         stop("'x' must be a GenomicRanges object")
     if (any(is.na(seqlengths(x))))
         stop("'seqlengths(x)' contains NAs")
     bins <- IRangesList(lapply(seqlengths(x),
                                function(seqlen)
                                  IRanges(breakInChunks(seqlen, binsize))))
     ans <- as(bins, "GRanges")
     seqinfo(ans) <- seqinfo(x)
     if (is.null(mcolnames))
         return(ans)
     averageMCol <- function(colname)
     {
         cvg <- coverage(x, weight=colname)
         views_list <- RleViewsList(
                           lapply(names(cvg),
                               function(seqname)
                                   Views(cvg[[seqname]], bins[[seqname]])))
         unlist(viewMeans(views_list), use.names=FALSE)
     }
     mcols(ans) <- DataFrame(lapply(mcols(x)[mcolnames], averageMCol))
     ans
}

Using it to solve Hermann's "density" problem:

   library(BSgenome.Hsapiens.UCSC.hg19)

   ## Set the sequence lengths.
   seqlengths(testset.gr) <- seqlengths(Hsapiens)[seqlevels(testset.gr)]

   ## Add the density metadata col.
   mcols(testset.gr)$density <- 1

   ## Compute the average per bin for the specified metadata cols.
   avg_per_bin <- averagePerBin(testset.gr, 200, mcolnames="density")

   > avg_per_bin[50:54]
   GRanges with 5 ranges and 1 metadata column:
         seqnames         ranges strand |   density
            <Rle>      <IRanges>  <Rle> | <numeric>
     [1]     chr1 [ 9801, 10000]      * |         0
     [2]     chr1 [10001, 10200]      * |     0.285
     [3]     chr1 [10201, 10400]      * |     0.765
     [4]     chr1 [10401, 10600]      * |     0.115
     [5]     chr1 [10601, 10800]      * |         0
     ---
     seqlengths:
           chr1     chr10     chr11     chr12 ...      chr9      chrX 
    chrY
      249250621 135534747 135006516 133851895 ... 141213431 155270560 
59373566

Note that calling averagePerBin() without specifying 'mcolnames' is a
convenient way to generate genomic bins covering the entire genome (and
in that case the supplied GRanges doesn't even need to contain ranges):

   > empty_gr <- GRanges(seqinfo=seqinfo(Hsapiens))
   > empty_gr
   GRanges with 0 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
     ---
     seqlengths:
                       chr1                  chr2 ...        chrUn_gl000249
                  249250621             243199373 ...                 38502
   > averagePerBin(empty_gr, 25000000)
   GRanges with 205 ranges and 0 metadata columns:
                 seqnames                 ranges strand
                    <Rle>              <IRanges>  <Rle>
       [1]           chr1 [        1,  25000000]      *
       [2]           chr1 [ 25000001,  50000000]      *
       [3]           chr1 [ 50000001,  75000000]      *
       [4]           chr1 [ 75000001, 100000000]      *
       [5]           chr1 [100000001, 125000000]      *
       [6]           chr1 [125000001, 150000000]      *
       [7]           chr1 [150000001, 175000000]      *
       [8]           chr1 [175000001, 200000000]      *
       [9]           chr1 [200000001, 225000000]      *
       ...            ...                    ...    ...
     [197] chrUn_gl000241             [1, 42152]      *
     [198] chrUn_gl000242             [1, 43523]      *
     [199] chrUn_gl000243             [1, 43341]      *
     [200] chrUn_gl000244             [1, 39929]      *
     [201] chrUn_gl000245             [1, 36651]      *
     [202] chrUn_gl000246             [1, 38154]      *
     [203] chrUn_gl000247             [1, 36422]      *
     [204] chrUn_gl000248             [1, 39786]      *
     [205] chrUn_gl000249             [1, 38502]      *
     ---
     seqlengths:
                       chr1                  chr2 ...        chrUn_gl000249
                  249250621             243199373 ...                 38502

If that seems useful/general enough, we should add this to the
GenomicRanges package. Could even be made a generic with methods
for IRanges and GenomicRanges. Suggestions?

Cheers,
H.

On 01/21/2013 03:23 PM, Martin Morgan wrote:
> On 01/20/2013 12:45 PM, Hermann Norpois wrote:
>> I am sorry, but I still have a problem. With your syntax I get a vector
>> that is going over all chromosomes. How do I get , for instance separate
>> vectors for each chromosomes?
>
> I wrote Dario's original solution as:
>
> Get seqlengths (probably these should have been added to testset.gr when
> it was created, e.g., coverage(BamFile("myBam"))
>
>    library(BSgenome.Hsapiens.UCSC.hg19)
>    seqlengths(testset.gr) <- seqlengths(Hsapiens)[seqlevels(testset.gr)]
>
> Calculate coverage
>
>    Gcoverage <- coverage(testset.gr)
>
> Here's Dario's function, revised to take a width argument and for the
> first argument, 'x', to be the coverage of a single chromosome
>
>    FUN <- function(x, y, width=200) {
>        starts <- seq(1, y, width)
>        mean(Views(x, start = starts, width=width), na.rm = TRUE)
>    }
>
> Then the calculation over all chromosomes
>
>    coverageMeans <-
>        unlist(Map(FUN, Gcoverage, seqlengths(testset.gr)), use.names=FALSE)
>
> And then to get separate vectors for each chromosome, it's just a matter
> of not unlisting
>
>    binsByChr <- Map(FUN, Gcoverage, seqlengths(testset.gr))
>
> with, e.g.,
>
>    binsByChr$chr1
>
> But maybe I'm not understanding what you're after?
>
> Martin
>
>> Thanks
>>
>> 2013/1/17 Hermann Norpois <hnorpois at googlemail.com>
>>
>>> Hello,
>>>
>>> I would like to transform a GRange-object into something with density
>>> information related of bins, e.g. 1 bin=200bps. So it starts at
>>> poistion 1
>>> etc ...
>>> That means (according to testset.gr):
>>> bin 1    0
>>> bin 2    0
>>> ...
>>> bin 50  0
>>> bin 51    0.28        #10000-10200 => (200-144)/200
>>> bin 52    0.765        #10200-10400 => 153/200
>>> ...
>>> etc
>>> according to testset.gr (see below).
>>>
>>> How does it work? Is there a special function?
>>> Thanks
>>> Hermann
>>>
>>>
>>>> testset.gr
>>> GRanges with 20 ranges and 0 metadata columns:
>>>         seqnames           ranges strand
>>>            <Rle>        <IRanges>  <Rle>
>>>     [1]     chr1 [ 10144,  10353]      *
>>>     [2]     chr1 [ 10441,  10463]      *
>>>     [3]     chr1 [235633, 235766]      *
>>>     [4]     chr1 [237717, 237895]      *
>>>     [5]     chr1 [521444, 521624]      *
>>>     [6]     chr1 [540609, 540786]      *
>>>     [7]     chr1 [564495, 564672]      *
>>>     [8]     chr1 [565254, 566081]      *
>>>     [9]     chr1 [566537, 567272]      *
>>>     ...      ...              ...    ...
>>>    [12]     chr1 [569610, 570190]      *
>>>    [13]     chr1 [601057, 601196]      *
>>>    [14]     chr1 [713227, 713475]      *
>>>    [15]     chr1 [752533, 752578]      *
>>>    [16]     chr1 [752654, 752695]      *
>>>    [17]     chr1 [754405, 754539]      *
>>>    [18]     chr1 [755336, 755918]      *
>>>    [19]     chr1 [756027, 756458]      *
>>>    [20]     chr1 [756793, 756843]      *
>>>    ---
>>>    seqlengths:
>>>      chr1 chr10 chr11 chr12 chr13 chr14 ...  chr6  chr7  chr8  chr9
>>> chrX
>>> chrY
>>>        NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA
>>> NA    NA
>>>> dput (testset.gr)
>>> new("GRanges"
>>>      , seqnames = new("Rle"
>>>      , values = structure(1L, .Label = c("chr1", "chr10", "chr11",
>>> "chr12",
>>> "chr13",
>>> "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2",
>>> "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7",
>>> "chr8", "chr9", "chrX", "chrY"), class = "factor")
>>>      , lengths = 20L
>>>      , elementMetadata = NULL
>>>      , metadata = list()
>>> )
>>>      , ranges = new("IRanges"
>>>      , start = c(10144L, 10441L, 235633L, 237717L, 521444L, 540609L,
>>> 564495L,
>>> 565254L, 566537L, 567450L, 568038L, 569610L, 601057L, 713227L,
>>> 752533L, 752654L, 754405L, 755336L, 756027L, 756793L)
>>>      , width = c(210L, 23L, 134L, 179L, 181L, 178L, 178L, 828L, 736L,
>>> 453L,
>>> 545L, 581L, 140L, 249L, 46L, 42L, 135L, 583L, 432L, 51L)
>>>      , NAMES = NULL
>>>      , elementType = "integer"
>>>      , elementMetadata = NULL
>>>      , metadata = list()
>>> )
>>>      , strand = new("Rle"
>>>      , values = structure(3L, .Label = c("+", "-", "*"), class =
>>> "factor")
>>>      , lengths = 20L
>>>      , elementMetadata = NULL
>>>      , metadata = list()
>>> )
>>>      , elementMetadata = new("DataFrame"
>>>      , rownames = NULL
>>>      , nrows = 20L
>>>      , listData = structure(list(), .Names = character(0))
>>>      , elementType = "ANY"
>>>      , elementMetadata = NULL
>>>      , metadata = list()
>>> )
>>>      , seqinfo = new("Seqinfo"
>>>      , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14",
>>> "chr15",
>>> "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21",
>>> "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
>>> "chrX", "chrY")
>>>      , seqlengths = c(NA_integer_, NA_integer_, NA_integer_,
>>> NA_integer_,
>>> NA_integer_,
>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_)
>>>      , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>> NA, NA,
>>> NA, NA,
>>> NA, NA, NA, NA, NA, NA, NA, NA, NA)
>>>      , genome = c(NA_character_, NA_character_, NA_character_,
>>> NA_character_,
>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>> NA_character_,
>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>> NA_character_,
>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>> NA_character_,
>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>> NA_character_
>>> )
>>> )
>>>      , metadata = list()
>>> )
>>>>
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319