[BioC] table for GenomicRanges
Hervé Pagès
hpages at fhcrc.org
Wed Dec 5 21:55:45 CET 2012
Hi,
As Tim pointed out, there is the question of what the output should
be.
Proposal 1: Output is a "table" object.
tableGenomicRanges1 <- function(x)
{
x_levels <- sort(unique(x))
y <- IRanges:::matchIntegerQuads(as.factor(seqnames(x)),
as.factor(strand(x)),
start(x), width(x),
as.factor(seqnames(x_levels)),
as.factor(strand(x_levels)),
start(x_levels), width(x_levels))
ans_names <- paste(seqnames(x_levels),
"[", start(x_levels), ",", end(x_levels), "]",
strand(x_levels), sep="")
ans <- table(factor(y, levels=seq_along(x_levels)))
names(ans) <- ans_names
ans
}
> tableGenomicRanges1(gr)
chr1[6,10]+ chr1[1,5]- chr1[1,10]- chr1[2,6]- chr1[3,7]-
303 333 345 337 345
chr1[101,110]- chr1[102,111]- chr1[103,112]- chr1[5,10]* chr2[2,10]+
327 308 343 308 338
chr2[3,10]+ chr2[4,8]- chr2[5,9]- chr2[6,10]- chr2[104,113]-
324 344 350 355 342
chr2[105,114]- chr2[106,115]- chr2[107,116]- chr2[4,10]* chr3[7,10]+
354 338 318 336 351
chr3[8,10]+ chr3[7,11]- chr3[8,12]- chr3[9,10]- chr3[9,13]-
355 360 357 324 325
chr3[10,10]- chr3[10,14]- chr3[108,117]- chr3[109,118]- chr3[110,119]-
346 307 277 320 330
but this output is not very convenient...
Proposal 2: Output is a GenomicRanges object containing the
sorted "levels". The counts are stored as a metadata col.
tableGenomicRanges2 <- function(x)
{
ans <- sort(unique(x))
names(ans) <- mcols(ans) <- NULL
y <- IRanges:::matchIntegerQuads(as.factor(seqnames(x)),
as.factor(strand(x)),
start(x), width(x),
as.factor(seqnames(x_levels)),
as.factor(strand(x_levels)),
start(x_levels), width(x_levels))
mcols(ans)$count <- tabulate(y, nbins=length(x_levels))
ans
}
> tableGenomicRanges2(gr)
GRanges with 30 ranges and 1 metadata column:
seqnames ranges strand | count
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 [ 6, 10] + | 303
[2] chr1 [ 1, 5] - | 333
[3] chr1 [ 1, 10] - | 345
[4] chr1 [ 2, 6] - | 337
[5] chr1 [ 3, 7] - | 345
[6] chr1 [101, 110] - | 327
[7] chr1 [102, 111] - | 308
[8] chr1 [103, 112] - | 343
[9] chr1 [ 5, 10] * | 308
... ... ... ... ... ...
[22] chr3 [ 7, 11] - | 360
[23] chr3 [ 8, 12] - | 357
[24] chr3 [ 9, 10] - | 324
[25] chr3 [ 9, 13] - | 325
[26] chr3 [ 10, 10] - | 346
[27] chr3 [ 10, 14] - | 307
[28] chr3 [108, 117] - | 277
[29] chr3 [109, 118] - | 320
[30] chr3 [110, 119] - | 330
---
seqlengths:
chr1 chr2 chr3
1000 2000 1500
Since the "table" method for GenomicRanges objects should really return
a "table" object it should do tableGenomicRanges1().
So what do we do for tableGenomicRanges2? Should we provide this
functionality thru a new generic function e.g. count()?
Note that tableGenomicRanges2() not only produces more convenient
output, it's also 4x faster or more than tableGenomicRanges1() on
big objects.
Thanks,
H.
On 12/05/2012 11:18 AM, Michael Lawrence wrote:
> I've often thought that this would be a really useful feature, that is,
> counting the combination of seqname, start, end and strand. This would
> follow the convention of the duplicated() method. All someone (Herve) needs
> to do is make an IRanges:::tableIntegerQuads. In the past, I have hacked
> this together by tabulating a string representation. Given the way R
> internalizes strings, that's not an ideal solution.
>
> Michael
>
> On Tue, Dec 4, 2012 at 4:25 PM, Vedran Franke <vfranke at bioinfo.hr> wrote:
>
>> Dear All,
>>
>> Is there an equivalent of the table function for a GenomicRanges object?
>>
>> Best,
>>
>> Vedran
>>
>>
>> --
>> Vedran Franke
>>
>> Bioinformatics Group,
>> Department of Molecular Biology,
>> Faculty of Science, Zagreb
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list