[Bioc-sig-seq] Extracting DNA sequences from BSgenome.Mmusculus.UCSC.mm9_1.3.11

Kasper Daniel Hansen khansen at stat.berkeley.edu
Mon Jun 1 20:33:22 CEST 2009


On Jun 1, 2009, at 11:16 , Hervé Pagès wrote:

> Hi Paul,
>
> Thanks for the feedback! I just set up a script that does something  
> similar
> to your code and was able to trigger the spooky/infamous bug. The  
> first time
> after 140 iterations (55 sec.), the second time after 73, etc...  
> Seems like
> I always get it after a few dozens of iterations. This is great  
> because
> that's going to make troubleshooting this issue a little bit easier!
>
> BTW, if you are interested in speeding up your code (and at the same  
> time, you
> might get rid of the spooky bug), you can now take advantage of the  
> fact that
> getSeq() is vectorized (starting with BSgenome >= 1.12.2). Given  
> that there is
> also a vectorized version of countPattern() (it's the  
> vcountPattern() function),
> then you don't need the "for" loop anymore:
>
>  myb.consen.count <- character(nrow(peaks))
>  all.genomic <- DNAStringSet(getSeq(Mmusculus, peaks$chr,
>                                     peaks$start, peaks$end))
>  c.1 <- vcountPattern(myb.dna, all.genomic, fixed=FALSE)
>  c.2 <- vcountPattern(myb.dna.comp, all.genomic, fixed=FALSE)
>  c.3 <- vcountPattern(myb.dna.rev, all.genomic, fixed=FALSE)
>  c.4 <- vcountPattern(myb.dna.comp.rev, all.genomic, fixed=FALSE)
>  ## Now c.1, c.2, c.3 and c.4 are integer vectors of the same length
>  ## as 'all.genomic'
>  the.sum <- c.1 + c.2 + c.3 + c.4
>  the.consen.count <- paste("F:", c.1, "FC:", c.2, "FRev:", c.3,  
> "FCRev:", c.4)
>  myb.consen.count[the.sum > 0] <- the.consen.count[the.sum > 0]
>
> This is thousands of times faster than calling getSeq() and  
> countPattern()
> inside a "for" loop. Also, it doesn't seem to trigger the spooky/ 
> infamous
> bug :-)

In my experience, the bug seems to come if I do a DNAStringSet on a  
Views, many times. I had some old code that did it essentially once  
per gene (always lots of errors) and moved it to once per chromosome  
(ran without problems).

Kasper


More information about the Bioc-sig-sequencing mailing list