[Bioc-sig-seq] aggregating a RangedData over an IRanges object

Simon Anders anders at ebi.ac.uk
Mon Jul 13 18:56:49 CEST 2009


Hi,

I've got RNA-Seq data in a wiggle file and exon coordinates in a GFF 
file. Reading this in with rtracklayer's import function, I get the 
wiggle file as UCSCData object and the GFF file as RangedData object.

I would now like to get the integral of the coverage for each exon.

So I extract the RangedData inside the UCSCData by coercion and get

   > cvg.bl1 <- as( import( "igb/10-BL1-coverage.wig" ), "RangedData" )
   > cvg.bl1
   RangedData: 253256 ranges by 1 column on 1 sequence
   colnames(1): score
   names(1): chr10

and have an IRanges object with the exon boundaries here:

   > exons <- import( "igb/10.gff" )
   > exons <- exons[ exons$type == "exon", ]
   > ranges(exons)[["10"]]
   IRanges instance:
               start       end width
   [1]         83769     83877   109
   [2]        190766    190883   118
   ...           ...       ...   ...
   [27083] 135347172 135347681   510

Now, I would like to go through each exon interval and sum up the 
coverage, i.e., do something like

   aggregate( cvg.bl1, ranges(exons), sum )

This, however, does not work, as aggregate seems to be unable to deal 
with a RangedData object as first parameter. This here, in contrast, 
does give a result

   aggregate( Rle( 1, 1000000000 ), ranges(exons), sum )

namely a vector of length 27083.

Now, I have three questions:

1. How can I get the vector referring to chr10 in cvg.bl1? Is there a 
way to coerce to Rle or numeric? Supposedly, this only make sense if I 
subset the RangedData object to only contain one chromosome.

2. How can I aggregate over the RangedData object? As it contains only 
one chromosome, it should be possible to coerce it to an Rle vector, but 
there is no coercion method, and, it should be possible anyway to avoid 
this.

3. In the present case, 'ranges(exons)' contains a CompressedIRangesList 
with 1 element, and not a simple IRanges object. If it contains more 
than one element, it seems that aggregate calculates the aggregate 
vector for each element and just concatenated them. Does this really 
make sense?

Cheers
   Simon


 > sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-26 r48838)
x86_64-unknown-linux-gnu

locale:
  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] IRanges_1.3.33    rtracklayer_1.5.7 RCurl_0.98-1      bitops_1.0-4.1

loaded via a namespace (and not attached):
[1] Biobase_2.5.4      Biostrings_2.13.24 BSgenome_1.13.10   tools_2.10.0
[5] XML_2.5-3



More information about the Bioc-sig-sequencing mailing list