[Bioc-sig-seq] Extracting DNA sequences from BSgenome.Mmusculus.UCSC.mm9_1.3.11

Paul Leo p.leo at uq.edu.au
Mon Jun 1 11:32:09 CEST 2009


Sorry if this is not on topic but I think I can generate *this* spooky
bug with this simple code, I post in case it helps:

library(BSgenome.Mmusculus.UCSC.mm9)
all.genomic<-getSeq(Mmusculus, the.chrom, starts, ends)

myb<-"YAACKG"        
myb.dna<-DNAString(myb)
myb.dna.rev<-reverse(myb.dna)
myb.dna.comp<-complement(myb.dna)
myb.dna.comp.rev<-reverseComplement(myb.dna)

### peaks is just a set of regions - not large but lots of them 
#> peaks[1:5,1:4]
 #     chr     start       end length
#7568   14 119073660 119074474    815
#11980  19  41922512  41923422    911
#9375   16  70351968  70352514    547
#158     1  33635080  33636130   1051
#15576   3 151996276 151997236    961

for(i in 1:dim(peaks)[1]){

genomic<-getSeq(Mmusculus,the.chrom[i],start=starts[i],end=ends[i],as.character=FALSE)
print(length(genomic)  )

c.1<-countPattern(myb.dna,genomic,max.mismatch=0, fixed=FALSE)
c.2<-countPattern(myb.dna.comp,genomic,max.mismatch=0, fixed=FALSE) 
c.3<-countPattern(myb.dna.rev,genomic,max.mismatch=0, fixed=FALSE)
c.4<-countPattern(myb.dna.comp.rev,genomic,max.mismatch=0, fixed=FALSE)
the.sum<-(c.1+c.2+c.3+c.4)
print(paste("counter:",i,"sites:",the.sum,sep=" "))
if( the.sum >
0){myb.consen.count[i,]<-paste("F:",c.1,"FC:",c.2,"FRev:",c.3,"FCRev:",c.4,sep=" ")}
}

This loop fails randomly ever few hundred iterations with no apparent
reason all the variables are fine , you can restart at the error point
and it goes fine for another few hundred then dies again....

eg

first dies at 
counter 103, 
.......
[1] "counter: 102 sites: 5"
[1] 729
Error in assign(".nextMethod", nextMethod, envir = callEnv) : 
  formal argument "envir" matched by multiple actual arguments
Error in isEmpty(xx) : 
  error in evaluating the argument 'x' in selecting a method for
function 'isEmpty'
Error in countPattern(pattern, toXStringViewsOrXString(subject),
algorithm = algorithm,  : 
  error in evaluating the argument 'subject' in selecting a method for
function 'countPattern'
> genomic
  729-letter "MaskedDNAString" instance (# for masking)
seq:
ACCAGGCTCCGGACGCCTTCTCCCCAGCGTTTTGCA...ATTGGGTTGGAAACTTTGTAGGAAGGAAAGGAAAAT
masks:

#### Genomic is the corect length and is ok..straight away can do:

> c.1<-countPattern(myb.dna,genomic,max.mismatch=0, fixed=FALSE)
> c.2<-countPattern(myb.dna.comp,genomic,max.mismatch=0, fixed=FALSE)
#SAME matchPattern(myb.dna,complement(genomic),max.mismatch=0,
fixed=FALSE)  
> c.3<-countPattern(myb.dna.rev,genomic,max.mismatch=0, fixed=FALSE)
> c.4<-countPattern(myb.dna.comp.rev,genomic,max.mismatch=0,
fixed=FALSE)
> the.sum<-(c.1+c.2+c.3+c.4)
> print(paste("counter:",i,"sites:",the.sum,sep=" "))
[1] "counter: 103 sites: 3"
> 
### see no problem what bug???


###### restart at i=103...it dies again at 604

[1] 737
[1] "counter: 604 sites: 3"
[1] 834
Error in assign(".defined", method at defined, envir = envir) : 
  formal argument "envir" matched by multiple actual arguments


## I will use Herves last post to get about this.....Can run on the ##
development version too if that helps??

> sessionInfo()
R version 2.9.0 (2009-04-17) 
x86_64-pc-linux-gnu 

locale:
LC_CTYPE=en_AU.UTF-8;LC_NUMERIC=C;LC_TIME=en_AU.UTF-8;LC_COLLATE=en_AU.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_AU.UTF-8;LC_PAPER=en_AU.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_AU.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
base     

other attached packages:
 [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11
BSgenome_1.12.0                   
 [3] Biostrings_2.12.0
IRanges_1.2.1                     
 [5] KEGG.db_2.2.11
RSQLite_0.7-1                     
 [7] DBI_0.2-4
AnnotationDbi_1.6.0               
 [9] Biobase_2.4.0
biomaRt_2.0.0                     

loaded via a namespace (and not attached):
[1] RCurl_0.94-1 tcltk_2.9.0  tools_2.9.0  XML_1.95-3  
> 


-  
Dr Paul Leo
Bioinformatician
Diamantina Institute for Cancer, Immunology and Metabolic Medicine
University of Queensland
--------------------------------------------------------------------------------------
Research Wing, Bldg 1
Princess Alexandria Hospital 
Woolloongabba, QLD, 4102
Tel: +61 7 3240 7740  Mob: 041 303 8691  Fax: +61 7 3240 5946
Email: p.leo at uq.edu.au   Web: http://www.di.uq.edu.au


-----Original Message-----
From: Hervé Pagès <hpages at fhcrc.org>
To: Ivan Gregoretti <ivangreg at gmail.com>
Cc: bioc-sig-sequencing at r-project.org
Subject: Re: [Bioc-sig-seq] Extracting DNA sequences from
BSgenome.Mmusculus.UCSC.mm9_1.3.11
Date: Thu, 28 May 2009 17:21:15 -0700

Ivan,

With BSgenome 1.12.1 (release) and 1.13.5 (devel) you can now do:

   myseqs <- data.frame(
     chr=c("chrY", "chr1", "chr2", "chr3", "chrY", "chr3", "chr1", "chr1"),
     start=c(NA, -40, 8510201, 4920301, 30001, 9220500, -2804, -30),
     end=c(50, NA, 8510220, 4920330, 30011, 9220555, -2801, -11)
   )

   library(BSgenome.Mmusculus.UCSC.mm9)

   > getSeq(Mmusculus, myseqs$chr, myseqs$start, myseqs$end)
   [1] "GATCCAAAACACATTCTCCCTGGTAGCATGGACAAGCAACATTTTGGGAG"
   [2] "TTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC"
   [3] "ACGACTATAAAAACCTTTAG"
   [4] "CATACAATAATTGTGGGGGAACTTCAAAAC"
   [5] "ATCTTAATCAC"
   [6] "CAGTAGTGGCGTACACCTTTAATCCCAGCACGTGGTAGGCAGAGGCAGATGGATTT"
   [7] "ATGA"
   [8] "AATTTGGTATTAAACTTAAA"

to extract multiple subsequences from multiple chromosomes at once.
(Note support for NAs and negative start or end.)

It takes about 35 sec. to extract 1 million short subsequences so it's
much faster than the sapply() solution I proposed earlier but there is
still room for improvement. Also, the extracted sequences are currently
returned as a character vector, not a DNAStringSet.
But addressing these 2 issues (speed + DNAStringSet) will require that
we first put in place an efficient implementation for combining several
DNAStringSet objects (this is an important piece of the infrastructure).
So that will take a few weeks.

Hopefully this time you won't get hit by the infamous bug you reported
earlier (BTW anything new on that front? Were you able to reproduce it?
Thanks).

Cheers,
H.

PS: The latest BSgenome software packages will propagate to the public
repos in the next 24 hours.


Ivan Gregoretti wrote:
> I didn't notice that I was emailing the individuals rather than the
> list!!!!!!
> 
> Briefly, THANKS TO EVERYONE.
> 
> Second, I managed to call and display all my sequences like this
> 
> sapply(1:dim(A)[1], function(i) paste(A$chr[i],' ',A$start[i],' ',A$end[i],'
> tags: ',A$tags[i],' ',getSeq(Mmusculus, as.character(A$chr[i]), A$start[i],
> A$end[i])))
> 
> As you see, it's made up of ideas from all three respondents. (Michael's
> solution actually gave me an idea to solve some other problem not mentioned
> here. So, nothing gets wasted.)
> 
> Sorry for the off-list emailing issue.
> 
> Ivan
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



More information about the Bioc-sig-sequencing mailing list