[Bioc-sig-seq] Extracting DNA sequences from BSgenome.Mmusculus.UCSC.mm9_1.3.11

Tue Jun 2 01:28:39 CEST 2009

HI Herve,
Yep thanks I had already got your idea from your previous post and
indeed it rocks, and I was able to finish the job last hight. FYI the
error did not occur with that modification. 

Also I think I have seen the error with this code also: (this I not my
code but something I was trying for someone)

array.labels<-c(1,2,3,5,6,7,8)
alignedLocs<-GenomeDataList()
for (i in length(array.labels))
{
  filename<-paste("aln",array.labels[i],".RData",sep="")
  load(filename)
  alignedLocs[[i]]<-as(aln,"GenomeData")
}
save(alignedLocs, file="alignedLocsTRUE.RData")

Now the aln1-8.RData filtes are just filtered and mapped reads from
solexa they range in size from 250-475Mb . If you run the loop the
Genome List was empty but if I run the loop one "i" at a time all is
fine. Now this happened a week or so ago so I will have to verify again
(I was running 2.10 then). But in case you are on the hunt it may be
easier to debug if you can reproduce it...

Cheers
Paul

-----Original Message-----
From: Hervé Pagès <hpages at fhcrc.org>
To: Paul Leo <p.leo at uq.edu.au>
Cc: Ivan Gregoretti <ivangreg at gmail.com>,
bioc-sig-sequencing at r-project.org
Subject: Re: [Bioc-sig-seq] Extracting DNA sequences from
BSgenome.Mmusculus.UCSC.mm9_1.3.11
Date: Mon, 01 Jun 2009 11:16:31 -0700

Hi Paul,

Thanks for the feedback! I just set up a script that does something
similar
to your code and was able to trigger the spooky/infamous bug. The first
time
after 140 iterations (55 sec.), the second time after 73, etc... Seems
like
I always get it after a few dozens of iterations. This is great because
that's going to make troubleshooting this issue a little bit easier!

BTW, if you are interested in speeding up your code (and at the same
time, you
might get rid of the spooky bug), you can now take advantage of the fact
that
getSeq() is vectorized (starting with BSgenome >= 1.12.2). Given that
there is
also a vectorized version of countPattern() (it's the vcountPattern()
function),
then you don't need the "for" loop anymore:

   myb.consen.count <- character(nrow(peaks))
   all.genomic <- DNAStringSet(getSeq(Mmusculus, peaks$chr,
                                      peaks$start, peaks$end))
   c.1 <- vcountPattern(myb.dna, all.genomic, fixed=FALSE)
   c.2 <- vcountPattern(myb.dna.comp, all.genomic, fixed=FALSE)
   c.3 <- vcountPattern(myb.dna.rev, all.genomic, fixed=FALSE)
   c.4 <- vcountPattern(myb.dna.comp.rev, all.genomic, fixed=FALSE)
   ## Now c.1, c.2, c.3 and c.4 are integer vectors of the same length
   ## as 'all.genomic'
   the.sum <- c.1 + c.2 + c.3 + c.4
   the.consen.count <- paste("F:", c.1, "FC:", c.2, "FRev:", c.3,
"FCRev:", c.4)
   myb.consen.count[the.sum > 0] <- the.consen.count[the.sum > 0]

This is thousands of times faster than calling getSeq() and
countPattern()
inside a "for" loop. Also, it doesn't seem to trigger the
spooky/infamous
bug :-)

Thanks again,
H.

Paul Leo wrote:
> Sorry if this is not on topic but I think I can generate *this* spooky
> bug with this simple code, I post in case it helps:
> 
> library(BSgenome.Mmusculus.UCSC.mm9)
> all.genomic<-getSeq(Mmusculus, the.chrom, starts, ends)
> 
> myb<-"YAACKG"        
> myb.dna<-DNAString(myb)
> myb.dna.rev<-reverse(myb.dna)
> myb.dna.comp<-complement(myb.dna)
> myb.dna.comp.rev<-reverseComplement(myb.dna)
> 
> ### peaks is just a set of regions - not large but lots of them 
> #> peaks[1:5,1:4]
>  #     chr     start       end length
> #7568   14 119073660 119074474    815
> #11980  19  41922512  41923422    911
> #9375   16  70351968  70352514    547
> #158     1  33635080  33636130   1051
> #15576   3 151996276 151997236    961
> 
> for(i in 1:dim(peaks)[1]){
> 
>
genomic<-getSeq(Mmusculus,the.chrom[i],start=starts[i],end=ends[i],as.character=FALSE)
> print(length(genomic)  )
> 
> c.1<-countPattern(myb.dna,genomic,max.mismatch=0, fixed=FALSE)
> c.2<-countPattern(myb.dna.comp,genomic,max.mismatch=0, fixed=FALSE) 
> c.3<-countPattern(myb.dna.rev,genomic,max.mismatch=0, fixed=FALSE)
> c.4<-countPattern(myb.dna.comp.rev,genomic,max.mismatch=0,
fixed=FALSE)
> the.sum<-(c.1+c.2+c.3+c.4)
> print(paste("counter:",i,"sites:",the.sum,sep=" "))
> if( the.sum >
>
0){myb.consen.count[i,]<-paste("F:",c.1,"FC:",c.2,"FRev:",c.3,"FCRev:",c.4,sep=" ")}
> }
> 
> This loop fails randomly ever few hundred iterations with no apparent
> reason all the variables are fine , you can restart at the error point
> and it goes fine for another few hundred then dies again....
> 
> eg
> 
> first dies at 
> counter 103, 
> .......
> [1] "counter: 102 sites: 5"
> [1] 729
> Error in assign(".nextMethod", nextMethod, envir = callEnv) : 
>   formal argument "envir" matched by multiple actual arguments
> Error in isEmpty(xx) : 
>   error in evaluating the argument 'x' in selecting a method for
> function 'isEmpty'
> Error in countPattern(pattern, toXStringViewsOrXString(subject),
> algorithm = algorithm,  : 
>   error in evaluating the argument 'subject' in selecting a method for
> function 'countPattern'
>> genomic
>   729-letter "MaskedDNAString" instance (# for masking)
> seq:
>
ACCAGGCTCCGGACGCCTTCTCCCCAGCGTTTTGCA...ATTGGGTTGGAAACTTTGTAGGAAGGAAAGGAAAAT
> masks:
> 
> #### Genomic is the corect length and is ok..straight away can do:
> 
>> c.1<-countPattern(myb.dna,genomic,max.mismatch=0, fixed=FALSE)
>> c.2<-countPattern(myb.dna.comp,genomic,max.mismatch=0, fixed=FALSE)
> #SAME matchPattern(myb.dna,complement(genomic),max.mismatch=0,
> fixed=FALSE)  
>> c.3<-countPattern(myb.dna.rev,genomic,max.mismatch=0, fixed=FALSE)
>> c.4<-countPattern(myb.dna.comp.rev,genomic,max.mismatch=0,
> fixed=FALSE)
>> the.sum<-(c.1+c.2+c.3+c.4)
>> print(paste("counter:",i,"sites:",the.sum,sep=" "))
> [1] "counter: 103 sites: 3"
> ### see no problem what bug???
> 
> 
> ###### restart at i=103...it dies again at 604
> 
> [1] 737
> [1] "counter: 604 sites: 3"
> [1] 834
> Error in assign(".defined", method at defined, envir = envir) : 
>   formal argument "envir" matched by multiple actual arguments
> 
> 
> ## I will use Herves last post to get about this.....Can run on the ##
> development version too if that helps??
> 
>> sessionInfo()
> R version 2.9.0 (2009-04-17) 
> x86_64-pc-linux-gnu 
> 
> locale:
>
LC_CTYPE=en_AU.UTF-8;LC_NUMERIC=C;LC_TIME=en_AU.UTF-8;LC_COLLATE=en_AU.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_AU.UTF-8;LC_PAPER=en_AU.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_AU.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base     
> 
> other attached packages:
>  [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11
> BSgenome_1.12.0                   
>  [3] Biostrings_2.12.0
> IRanges_1.2.1                     
>  [5] KEGG.db_2.2.11
> RSQLite_0.7-1                     
>  [7] DBI_0.2-4
> AnnotationDbi_1.6.0               
>  [9] Biobase_2.4.0
> biomaRt_2.0.0                     
> 
> loaded via a namespace (and not attached):
> [1] RCurl_0.94-1 tcltk_2.9.0  tools_2.9.0  XML_1.95-3  
> 
> 
> -  
> Dr Paul Leo
> Bioinformatician
> Diamantina Institute for Cancer, Immunology and Metabolic Medicine
> University of Queensland
>
--------------------------------------------------------------------------------------
> Research Wing, Bldg 1
> Princess Alexandria Hospital 
> Woolloongabba, QLD, 4102
> Tel: +61 7 3240 7740  Mob: 041 303 8691  Fax: +61 7 3240 5946
> Email: p.leo at uq.edu.au   Web: http://www.di.uq.edu.au
> 
> 
> -----Original Message-----
> From: Hervé Pagès <hpages at fhcrc.org>
> To: Ivan Gregoretti <ivangreg at gmail.com>
> Cc: bioc-sig-sequencing at r-project.org
> Subject: Re: [Bioc-sig-seq] Extracting DNA sequences from
> BSgenome.Mmusculus.UCSC.mm9_1.3.11
> Date: Thu, 28 May 2009 17:21:15 -0700
> 
> Ivan,
> 
> With BSgenome 1.12.1 (release) and 1.13.5 (devel) you can now do:
> 
>    myseqs <- data.frame(
>      chr=c("chrY", "chr1", "chr2", "chr3", "chrY", "chr3", "chr1",
"chr1"),
>      start=c(NA, -40, 8510201, 4920301, 30001, 9220500, -2804, -30),
>      end=c(50, NA, 8510220, 4920330, 30011, 9220555, -2801, -11)
>    )
> 
>    library(BSgenome.Mmusculus.UCSC.mm9)
> 
>    > getSeq(Mmusculus, myseqs$chr, myseqs$start, myseqs$end)
>    [1] "GATCCAAAACACATTCTCCCTGGTAGCATGGACAAGCAACATTTTGGGAG"
>    [2] "TTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC"
>    [3] "ACGACTATAAAAACCTTTAG"
>    [4] "CATACAATAATTGTGGGGGAACTTCAAAAC"
>    [5] "ATCTTAATCAC"
>    [6] "CAGTAGTGGCGTACACCTTTAATCCCAGCACGTGGTAGGCAGAGGCAGATGGATTT"
>    [7] "ATGA"
>    [8] "AATTTGGTATTAAACTTAAA"
> 
> to extract multiple subsequences from multiple chromosomes at once.
> (Note support for NAs and negative start or end.)
> 
> It takes about 35 sec. to extract 1 million short subsequences so it's
> much faster than the sapply() solution I proposed earlier but there is
> still room for improvement. Also, the extracted sequences are
currently
> returned as a character vector, not a DNAStringSet.
> But addressing these 2 issues (speed + DNAStringSet) will require that
> we first put in place an efficient implementation for combining
several
> DNAStringSet objects (this is an important piece of the
infrastructure).
> So that will take a few weeks.
> 
> Hopefully this time you won't get hit by the infamous bug you reported
> earlier (BTW anything new on that front? Were you able to reproduce
it?
> Thanks).
> 
> Cheers,
> H.
> 
> PS: The latest BSgenome software packages will propagate to the public
> repos in the next 24 hours.
> 
> 
> Ivan Gregoretti wrote:
>> I didn't notice that I was emailing the individuals rather than the
>> list!!!!!!
>>
>> Briefly, THANKS TO EVERYONE.
>>
>> Second, I managed to call and display all my sequences like this
>>
>> sapply(1:dim(A)[1], function(i) paste(A$chr[i],' ',A$start[i],' ',A
$end[i],'
>> tags: ',A$tags[i],' ',getSeq(Mmusculus, as.character(A$chr[i]), A
$start[i],
>> A$end[i])))
>>
>> As you see, it's made up of ideas from all three respondents.
(Michael's
>> solution actually gave me an idea to solve some other problem not
mentioned
>> here. So, nothing gets wasted.)
>>
>> Sorry for the off-list emailing issue.
>>
>> Ivan
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
>