[Bioc-sig-seq] `+` for GenomeData and coverage from several lanes

Tue Jun 30 18:49:48 CEST 2009

Michael Lawrence wrote:
> On Tue, Jun 30, 2009 at 8:32 AM, Simon Anders <anders at ebi.ac.uk> wrote:
> 
>> Hi Michael et al.
>>
>> Assume I have a list of AlignedRead objects, with data from several Solexa
>> lanes. I would like to get a coverage vector over all the lanes.
>>
>> As 'coverage' takes only one AlignedRead object, I have two possibilities:
>>
>> (a) Concatenate the AlignedRead objects to a single big one. As far as I
>> can see, 'rbind2' is not implemented for AlignedRead, and 'append' seemed
>> very slow to me. It is probably not posible to do this without making a copy
>> of all the data in memory.
>>
> 
> I think this is what combineLanes() in the chipseq package does, and it
> seems to work fine for us. Note that it takes a GenomeDataList, rather than
> a list of AlignedReads, but ShortRead makes it easy to get a GenomeDataList,
> if I remember right.

GenomeData is under-specified in terms of its content, and combineLanes
is I think expecting distinct plus and minus strand information. This
also makes it difficult to implement a '+' on GenomeData.

I have a "Reduce"-like function I'll add to BSgenome later today.

Martin

> Of course, optimization is welcome.
> 
> 
>> (b) Calculate the coverage for each AlignedRead object separately and add
>> up the GenomeData objects. The `+` operator is not supported for GenomeData
>> objects but it is for Rle objects. So I need to loop through the
>> chromosomes.
>>
> 
>> I've now written this short function for the purpose:
>>
>> sumUpCoverage <- function( lanes, seqLens )
>> {
>>   res <- NULL
>>   for( i in 1:length(lanes) ) {
>>      cvg <- coverage( lanes, width=seqLens )
>>      if( is.null( res ) )
>>         res <- cvg
>>      else {
>>         stopifnot( all( names(res) == names(cvg) ) )
>>         for( seq in names(res) )
>>            res[[seq]] <- res[[seq]] + cvg[[seq]]
>>      }
>>   }
>>   res
>> }
>>
>> This does not seem very elegant (and it takes 9 minutes, which, however,
>> might be ok for 29 mio reads). Any idea how to do it better? (The use of
>> 'for' instead of 'sapply' is on purpose: I hope to save memory that way.)
>>
>> And would it make sense to overload the `+` operator for GenomeData
>> objects? If so, could I suggest adding this to BSgenome?
>>
>> Cheers
>>  Simon
>>
>>
>>
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing