[Bioc-sig-seq] `+` for GenomeData and coverage from several lanes

Simon Anders anders at ebi.ac.uk
Sun Jul 5 16:16:13 CEST 2009


Hi,

sorry for sending so many mails, but  my thought process seems to be a
bit iterative today.

I'd like to add to my previous mail with an observation regarding
performance.

Let's say we define a group generic as suggested:

Simon Anders wrote:
> What about adding a group generic as follows?
> 
> setMethod( "Arith", signature( e1 = "GenomeData", e2 = "GenomeData" ),
>    function( e1, e2 )
>       gdreduce(  # or: GenomeData( Map(
>          function( r1, r2 ) get(.Generic)( r1, r2 ),
>          e1, e2 ) )

I'll now take the Braski et al. H3K4me1 data from my vignette, and after
some filtering, I'll work with the variable 'lanes', which is a list of
seven AlignedData objects, witha total of 27 mio reads:

> sapply( lanes, length )
run13_lane4.map run15_lane2.map run16_lane2.map run16_lane3.map
         839361         3140651         3867821         3934167
 run2_lane6.map  run2_lane7.map  run2_lane8.map
        4930304         5588437         5092628
> sum( sapply( lanes, length ) )
[1] 27393369


Given the `+` method defined above, the natural way to get the coverage
of all these lanes together is the following simple line:

cvg1 <- Reduce( `+`, lapply( lanes, coverage, width=seqlens ) )


This is elegant and standard functional-programming style, but it is
slow, compared to a for loop:

> system.time(
+    cvg1 <- Reduce( `+`, lapply( lanes, coverage, width=seqlens ) )
+ )
   user  system elapsed
374.873 104.028 478.975

> system.time({
+    cvg2 <- GenomeData( list() )
+    for( i in length(lanes) )
+       cvg2 <- cvg2 + coverage( lanes[[i]], width=seqlens )
+ })
   user  system elapsed
 56.551   9.050  65.634


Somehow, this is not very satisfying. To me, it suggests that one really
needs in-place operations to deal with such huge amounts of data.


  Simon



More information about the Bioc-sig-sequencing mailing list