[BioC] advice on Biostrings

Thu Feb 23 15:04:04 CET 2006

thanks everybody. this made it much faster:

tmpseq <- DNAString(pmseq)
GC <- DNAString("GC")
CG <- DNAString("CG")
tmp <- sapply(1:length(tmpseq),function(i){
  y=tmpseq[i]
  c(alphabetFrequency(y)[2:5],
    length(matchDNAPattern(GC,y)),
    length(matchDNAPattern(CG,y)))
})

Martin Morgan wrote:

>Actually, this function in Biostrings uses one of those fiendishly
>clever algorithms coded in C, and as a consequence is *very* fast: on
>my puny windows laptop, searching a random sequence of 10 million bp
>for "GC" returns more than 1/2 million matches in about a 10th of a
>second:
>
>  
>
>>seq <- paste( dna[ runif( 10000000, min=1, max=5 )], collapse="" )
>>system.time( res <- length(matchDNAPattern( "GC", seq ))); res
>>    
>>
>[1] 0.13 0.00 0.12   NA   NA
>[1] 626393
>
>Maybe your speed issues are related to R, or to memory management, or
>'user error' ;) ?. Here are suggestions:
>
>* If the match pattern "GC" is a character, it is converted to a
>DNAString by matchDNAPattern. Short-circuit this step by
>
>GC <- DNAString("GC")
>
>outside the sapply, and then
>
>matchDNAPattern( GC, y )
>
>This seems to actually make quite a bit of difference:
>
>  
>
>>seqs <- lapply( 1:1000, function(i) paste( dna[ runif( 100, min=1, max=5 )], collapse="" ))
>>system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( "GC", s ))))
>>    
>>
>[1] 27.52  0.01 27.61    NA    NA
>  
>
>>GC <- DNAString("GC")
>>system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( GC, s ))))
>>    
>>
>[1] 16.53  0.00 16.56    NA    NA
>  
>
>
>* Make sure your function is actually doing what you want. My mistake
>in exploring this was to write
>
>  
>
>>seq <- paste( dna[ runif( 10000000, min=1, max=5 )])
>>    
>>
>
>(i.e., without the 'collapse' argument). this (tried to) create a
>vector with 10 million single characters. The sapply would then have
>looped over these, rather than the single string I intended!
>
>* Do the calculations in parallel
>
>  
>
>>library(snow)
>>cl <- makeCluster(4, "MPI")
>>clusterEvalQ(cl, library(Biostrings))
>>system.time(res <- parSapply( cl, seqs, function(s) length(matchDNAPattern( GC, s ))))
>>    
>>
>[1] 0.01 0.00 4.56 0.00 0.00
>
>!
>
>* Perhaps your R is spending a lot of time swapping virtual memory to
>disk. Try garbage collection
>
>gc()
>
>before performing the analysis. If this doesn't halp, and it seems
>like memory really is an issue, then perhaps reading the strings in a
>chunk at a time would be necessary. You might alos try separating the
>sapply into three separate calls, one for each function. I doubt this
>is really a problem, unless you're dealing with *very* large
>collections of sequence.
>
>* I found that analyzing some number of bp in a few strings is much
>faster than analyzing the same total number of bp in many strings,
>i.e., the bottleneck is actually in sapply. There are apparently
>tricks to enhancing sapply speed, though my exploration in the present
>context didn't provide any dramatic results.
>
>Hope these suggestions help...
>
>Martin
>
>Wolfgang Huber <huber at ebi.ac.uk> writes:
>
>  
>
>>Rafael A Irizarry wrote:
>>    
>>
>>>hi im using biostrings to count base content as well as pair of bases 
>>>content. im using the following sniped of code:
>>>
>>>      
>>>
>>Hi Rafa,
>>
>>to count symbols in character vectors, matchprobes:basecontent is fast:
>>
>>library(matchprobes)
>>v   = c("AAACT", "GGGTT", "ggAtT")
>>bc  = basecontent(v)
>>print.default(bc)
>>bc[,"C"]+bc[,"G"]
>>
>>and if there is interest I'd be happy amend the C code to also count 
>>pairs of bases (or you could, it is not terribly complicated).
>>
>>  Cheers
>>  Wolfgang
>>
>>    
>>
>>>###pmseq is a vector of character strings (not of the same nchar).
>>>tmp <- sapply(pmseq,function(x){
>>>  y = DNAString(x)
>>>  c(alphabetFrequency(y)[2:5], ##count A,T,G,C
>>>    length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) 
>>>##count GC or CG
>>>})
>>>
>>>it is painfully slow. strsplit and grep were much faster for the first 
>>>part (counting bases) but the using grep for the second part was not 
>>>straight forward.
>>>
>>>any suggestions?
>>>      
>>>
>>-------------------------------------
>>Wolfgang Huber
>>European Bioinformatics Institute
>>European Molecular Biology Laboratory
>>Cambridge CB10 1SD
>>England
>>Phone: +44 1223 494642
>>Fax:   +44 1223 494486
>>Http:  www.ebi.ac.uk/huber
>>
>>_______________________________________________
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch
>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>    
>>
>
>  
>