[Bioc-sig-seq] Pruning rows from a RangedData object

Patrick Aboyoun paboyoun at fhcrc.org
Sat Jul 18 18:09:51 CEST 2009


Simon,
I found a faster approach to obtaining the duplicate rows by ranges 
within space of RangedData objects. Rather than loop over the RangedData 
object x, it is faster to loop over the RangesList ranges(x).

 > suppressMessages(library(IRanges))
 > load("exons0.rda")

 > system.time({
+ dupeRows1 <- unlist(sapply(exons0, function(a)
+    duplicated(ranges(a)[[1]])))
+ exons1 <- exons0[dupeRows1, ]
+ })
   user  system elapsed
 12.848   2.368  15.327

 > system.time({
+ dupeRows2 <- unlist(lapply(ranges(exons0), duplicated))
+ exons2 <- exons0[dupeRows2, ]
+ })
   user  system elapsed
  3.768   0.616   4.502

 > identical(exons1, exons2)
[1] TRUE

 > sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] IRanges_1.3.39



Patrick



Patrick Aboyoun wrote:
> Simon,
> You found a bug. I have fixed it in BioC 2.5 (devel) and it will be 
> available from bioconductor.org in 24 hours, but you can get it from 
> svn now.
>
> I think the rdapply function was intended for performing filtering on 
> RangedData objects, but I haven't used it too much.
>
>
> > suppressMessages(library(IRanges))
> > load("exons0.rda")
> > dupeRows <- unlist(sapply(exons0, function(a)
> +   duplicated(ranges(a)[[1]])))
> > exons1 <- exons0[ dupeRows, ]
> > exons1["chr01"]
> RangedData: 34312 ranges by 5 columns on 1 sequence
> colnames(5): type source phase strand group
> names(1): chr01
> > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
> i386-apple-darwin9.7.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base   
> other attached packages:
> [1] IRanges_1.3.38
>
>
>
>
> Simon Anders wrote:
>> Hi Patrick et al.
>>
>> do you have any suggestions on the following?
>>
>> I've got an RangesList object 'exons0' that I created from a GFF file 
>> for the human genome. This GFF file contains all transcripts, and 
>> lists all those exons that appear in multiple transcripts multiple 
>> times. I would like to filter them out.
>>
>> I was pleased to see that you redefined 'duplicated' for RangedData, 
>> which allowed me to find the rows in the IRanges object that are 
>> duplicates. But how do I prune them?
>>
>> My first try was this here:
>>
>>   dupeRows <- unlist( sapply( exons0, function(a)
>>      duplicated(ranges(a)[[1]]) ) )
>>   exons1 <- exons0[ dupeRows, ]
>>
>> This seem to do the job:
>>
>>   > exons0
>>   RangedData: 507249 ranges by 5 columns on 25 sequences
>>   colnames(5): type source phase strand group
>>   names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
>>   chrX chrY
>>
>>   > dupeRows <- unlist( sapply( exons0, function(a)
>>   +    duplicated(ranges(a)[[1]]) ) )
>>   > exons1 <- exons0[ dupeRows, ]
>>
>>   > exons1
>>   RangedData: 253143 ranges by 5 columns on 25 sequences
>>   colnames(5): type source phase strand group
>>   names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
>>   chrX chrY
>>
>> However, the resulting object behaves oddly. Compare:
>>
>>   > exons0["chr01"]
>>   RangedData: 61840 ranges by 5 columns on 1 sequence
>>   colnames(5): type source phase strand group
>>   names(1): chr01
>>
>>   > exons1["chr01"]
>>   Error in values[i] : mismatching names (and NULL elements not allowed)
>>
>> What's going on here?
>>
>>
>> I've now used this command here instead, which does the job, but 
>> looks quite unwieldy and is very slow:
>>
>>   exons <-
>>   do.call( c, unname( lapply( exons0, function(a)
>>      a[ !duplicated( ranges(a)[[1]] ), ] ) ) )
>>
>>
>> In case you want to try this yourself, you can find the 'exon0' 
>> object here: http://www.ebi.ac.uk/~anders/tmp/exons0.rda
>>
>>
>> Cheers
>>   Simon
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



More information about the Bioc-sig-sequencing mailing list