[BioC] Transformation of Grange object to density per bin
Hervé Pagès
hpages at fhcrc.org
Wed Jan 23 02:45:19 CET 2013
Hi,
This looks very much like this other thread (subject: GRanges
averaging) from last October:
https://stat.ethz.ch/pipermail/bioconductor/2012-October/048338.html
Actually Hermann's "density" problem can be seen as a particular case
of the "average per bin" problem where the metadata column to average
contains only 1's.
Here is a proposal for a function that tries to solve the "average per
bin" problem (not sure averagePerBin is a good name for this function):
### 'x': a GenomicRanges objects with non-NA seqlengths.
### 'binsize': a single positive integer.
### 'mcolnames': names of numeric metadata columns in 'x' to "average"
### i.e. to propagate to the result after averaging them
### on each bin.
### Returns a GRanges object with: (a) the same seqinfo as 'x',
### (b) ranges of width 'binsize' covering all the sequences in
### 'seqinfo(x)', and (c) the "averaged" metadata columns specified
### in 'mcolnames'.
averagePerBin <- function(x, binsize, mcolnames=NULL)
{
if (!is(x, "GenomicRanges"))
stop("'x' must be a GenomicRanges object")
if (any(is.na(seqlengths(x))))
stop("'seqlengths(x)' contains NAs")
bins <- IRangesList(lapply(seqlengths(x),
function(seqlen)
IRanges(breakInChunks(seqlen, binsize))))
ans <- as(bins, "GRanges")
seqinfo(ans) <- seqinfo(x)
if (is.null(mcolnames))
return(ans)
averageMCol <- function(colname)
{
cvg <- coverage(x, weight=colname)
views_list <- RleViewsList(
lapply(names(cvg),
function(seqname)
Views(cvg[[seqname]], bins[[seqname]])))
unlist(viewMeans(views_list), use.names=FALSE)
}
mcols(ans) <- DataFrame(lapply(mcols(x)[mcolnames], averageMCol))
ans
}
Using it to solve Hermann's "density" problem:
library(BSgenome.Hsapiens.UCSC.hg19)
## Set the sequence lengths.
seqlengths(testset.gr) <- seqlengths(Hsapiens)[seqlevels(testset.gr)]
## Add the density metadata col.
mcols(testset.gr)$density <- 1
## Compute the average per bin for the specified metadata cols.
avg_per_bin <- averagePerBin(testset.gr, 200, mcolnames="density")
> avg_per_bin[50:54]
GRanges with 5 ranges and 1 metadata column:
seqnames ranges strand | density
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 [ 9801, 10000] * | 0
[2] chr1 [10001, 10200] * | 0.285
[3] chr1 [10201, 10400] * | 0.765
[4] chr1 [10401, 10600] * | 0.115
[5] chr1 [10601, 10800] * | 0
---
seqlengths:
chr1 chr10 chr11 chr12 ... chr9 chrX
chrY
249250621 135534747 135006516 133851895 ... 141213431 155270560
59373566
Note that calling averagePerBin() without specifying 'mcolnames' is a
convenient way to generate genomic bins covering the entire genome (and
in that case the supplied GRanges doesn't even need to contain ranges):
> empty_gr <- GRanges(seqinfo=seqinfo(Hsapiens))
> empty_gr
GRanges with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
---
seqlengths:
chr1 chr2 ... chrUn_gl000249
249250621 243199373 ... 38502
> averagePerBin(empty_gr, 25000000)
GRanges with 205 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [ 1, 25000000] *
[2] chr1 [ 25000001, 50000000] *
[3] chr1 [ 50000001, 75000000] *
[4] chr1 [ 75000001, 100000000] *
[5] chr1 [100000001, 125000000] *
[6] chr1 [125000001, 150000000] *
[7] chr1 [150000001, 175000000] *
[8] chr1 [175000001, 200000000] *
[9] chr1 [200000001, 225000000] *
... ... ... ...
[197] chrUn_gl000241 [1, 42152] *
[198] chrUn_gl000242 [1, 43523] *
[199] chrUn_gl000243 [1, 43341] *
[200] chrUn_gl000244 [1, 39929] *
[201] chrUn_gl000245 [1, 36651] *
[202] chrUn_gl000246 [1, 38154] *
[203] chrUn_gl000247 [1, 36422] *
[204] chrUn_gl000248 [1, 39786] *
[205] chrUn_gl000249 [1, 38502] *
---
seqlengths:
chr1 chr2 ... chrUn_gl000249
249250621 243199373 ... 38502
If that seems useful/general enough, we should add this to the
GenomicRanges package. Could even be made a generic with methods
for IRanges and GenomicRanges. Suggestions?
Cheers,
H.
On 01/21/2013 03:23 PM, Martin Morgan wrote:
> On 01/20/2013 12:45 PM, Hermann Norpois wrote:
>> I am sorry, but I still have a problem. With your syntax I get a vector
>> that is going over all chromosomes. How do I get , for instance separate
>> vectors for each chromosomes?
>
> I wrote Dario's original solution as:
>
> Get seqlengths (probably these should have been added to testset.gr when
> it was created, e.g., coverage(BamFile("myBam"))
>
> library(BSgenome.Hsapiens.UCSC.hg19)
> seqlengths(testset.gr) <- seqlengths(Hsapiens)[seqlevels(testset.gr)]
>
> Calculate coverage
>
> Gcoverage <- coverage(testset.gr)
>
> Here's Dario's function, revised to take a width argument and for the
> first argument, 'x', to be the coverage of a single chromosome
>
> FUN <- function(x, y, width=200) {
> starts <- seq(1, y, width)
> mean(Views(x, start = starts, width=width), na.rm = TRUE)
> }
>
> Then the calculation over all chromosomes
>
> coverageMeans <-
> unlist(Map(FUN, Gcoverage, seqlengths(testset.gr)), use.names=FALSE)
>
> And then to get separate vectors for each chromosome, it's just a matter
> of not unlisting
>
> binsByChr <- Map(FUN, Gcoverage, seqlengths(testset.gr))
>
> with, e.g.,
>
> binsByChr$chr1
>
> But maybe I'm not understanding what you're after?
>
> Martin
>
>> Thanks
>>
>> 2013/1/17 Hermann Norpois <hnorpois at googlemail.com>
>>
>>> Hello,
>>>
>>> I would like to transform a GRange-object into something with density
>>> information related of bins, e.g. 1 bin=200bps. So it starts at
>>> poistion 1
>>> etc ...
>>> That means (according to testset.gr):
>>> bin 1 0
>>> bin 2 0
>>> ...
>>> bin 50 0
>>> bin 51 0.28 #10000-10200 => (200-144)/200
>>> bin 52 0.765 #10200-10400 => 153/200
>>> ...
>>> etc
>>> according to testset.gr (see below).
>>>
>>> How does it work? Is there a special function?
>>> Thanks
>>> Hermann
>>>
>>>
>>>> testset.gr
>>> GRanges with 20 ranges and 0 metadata columns:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] chr1 [ 10144, 10353] *
>>> [2] chr1 [ 10441, 10463] *
>>> [3] chr1 [235633, 235766] *
>>> [4] chr1 [237717, 237895] *
>>> [5] chr1 [521444, 521624] *
>>> [6] chr1 [540609, 540786] *
>>> [7] chr1 [564495, 564672] *
>>> [8] chr1 [565254, 566081] *
>>> [9] chr1 [566537, 567272] *
>>> ... ... ... ...
>>> [12] chr1 [569610, 570190] *
>>> [13] chr1 [601057, 601196] *
>>> [14] chr1 [713227, 713475] *
>>> [15] chr1 [752533, 752578] *
>>> [16] chr1 [752654, 752695] *
>>> [17] chr1 [754405, 754539] *
>>> [18] chr1 [755336, 755918] *
>>> [19] chr1 [756027, 756458] *
>>> [20] chr1 [756793, 756843] *
>>> ---
>>> seqlengths:
>>> chr1 chr10 chr11 chr12 chr13 chr14 ... chr6 chr7 chr8 chr9
>>> chrX
>>> chrY
>>> NA NA NA NA NA NA ... NA NA NA NA
>>> NA NA
>>>> dput (testset.gr)
>>> new("GRanges"
>>> , seqnames = new("Rle"
>>> , values = structure(1L, .Label = c("chr1", "chr10", "chr11",
>>> "chr12",
>>> "chr13",
>>> "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2",
>>> "chr20", "chr21", "chr22", "chr3", "chr4", "chr5", "chr6", "chr7",
>>> "chr8", "chr9", "chrX", "chrY"), class = "factor")
>>> , lengths = 20L
>>> , elementMetadata = NULL
>>> , metadata = list()
>>> )
>>> , ranges = new("IRanges"
>>> , start = c(10144L, 10441L, 235633L, 237717L, 521444L, 540609L,
>>> 564495L,
>>> 565254L, 566537L, 567450L, 568038L, 569610L, 601057L, 713227L,
>>> 752533L, 752654L, 754405L, 755336L, 756027L, 756793L)
>>> , width = c(210L, 23L, 134L, 179L, 181L, 178L, 178L, 828L, 736L,
>>> 453L,
>>> 545L, 581L, 140L, 249L, 46L, 42L, 135L, 583L, 432L, 51L)
>>> , NAMES = NULL
>>> , elementType = "integer"
>>> , elementMetadata = NULL
>>> , metadata = list()
>>> )
>>> , strand = new("Rle"
>>> , values = structure(3L, .Label = c("+", "-", "*"), class =
>>> "factor")
>>> , lengths = 20L
>>> , elementMetadata = NULL
>>> , metadata = list()
>>> )
>>> , elementMetadata = new("DataFrame"
>>> , rownames = NULL
>>> , nrows = 20L
>>> , listData = structure(list(), .Names = character(0))
>>> , elementType = "ANY"
>>> , elementMetadata = NULL
>>> , metadata = list()
>>> )
>>> , seqinfo = new("Seqinfo"
>>> , seqnames = c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14",
>>> "chr15",
>>> "chr16", "chr17", "chr18", "chr19", "chr2", "chr20", "chr21",
>>> "chr22", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
>>> "chrX", "chrY")
>>> , seqlengths = c(NA_integer_, NA_integer_, NA_integer_,
>>> NA_integer_,
>>> NA_integer_,
>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
>>> NA_integer_, NA_integer_, NA_integer_, NA_integer_)
>>> , is_circular = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>> NA, NA,
>>> NA, NA,
>>> NA, NA, NA, NA, NA, NA, NA, NA, NA)
>>> , genome = c(NA_character_, NA_character_, NA_character_,
>>> NA_character_,
>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>> NA_character_,
>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>> NA_character_,
>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>> NA_character_,
>>> NA_character_, NA_character_, NA_character_, NA_character_,
>>> NA_character_
>>> )
>>> )
>>> , metadata = list()
>>> )
>>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list