[Bioc-sig-seq] Extracting DNA sequences from BSgenome.Mmusculus.UCSC.mm9_1.3.11

Hervé Pagès hpages at fhcrc.org
Fri May 29 02:21:15 CEST 2009


Ivan,

With BSgenome 1.12.1 (release) and 1.13.5 (devel) you can now do:

   myseqs <- data.frame(
     chr=c("chrY", "chr1", "chr2", "chr3", "chrY", "chr3", "chr1", "chr1"),
     start=c(NA, -40, 8510201, 4920301, 30001, 9220500, -2804, -30),
     end=c(50, NA, 8510220, 4920330, 30011, 9220555, -2801, -11)
   )

   library(BSgenome.Mmusculus.UCSC.mm9)

   > getSeq(Mmusculus, myseqs$chr, myseqs$start, myseqs$end)
   [1] "GATCCAAAACACATTCTCCCTGGTAGCATGGACAAGCAACATTTTGGGAG"
   [2] "TTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC"
   [3] "ACGACTATAAAAACCTTTAG"
   [4] "CATACAATAATTGTGGGGGAACTTCAAAAC"
   [5] "ATCTTAATCAC"
   [6] "CAGTAGTGGCGTACACCTTTAATCCCAGCACGTGGTAGGCAGAGGCAGATGGATTT"
   [7] "ATGA"
   [8] "AATTTGGTATTAAACTTAAA"

to extract multiple subsequences from multiple chromosomes at once.
(Note support for NAs and negative start or end.)

It takes about 35 sec. to extract 1 million short subsequences so it's
much faster than the sapply() solution I proposed earlier but there is
still room for improvement. Also, the extracted sequences are currently
returned as a character vector, not a DNAStringSet.
But addressing these 2 issues (speed + DNAStringSet) will require that
we first put in place an efficient implementation for combining several
DNAStringSet objects (this is an important piece of the infrastructure).
So that will take a few weeks.

Hopefully this time you won't get hit by the infamous bug you reported
earlier (BTW anything new on that front? Were you able to reproduce it?
Thanks).

Cheers,
H.

PS: The latest BSgenome software packages will propagate to the public
repos in the next 24 hours.


Ivan Gregoretti wrote:
> I didn't notice that I was emailing the individuals rather than the
> list!!!!!!
> 
> Briefly, THANKS TO EVERYONE.
> 
> Second, I managed to call and display all my sequences like this
> 
> sapply(1:dim(A)[1], function(i) paste(A$chr[i],' ',A$start[i],' ',A$end[i],'
> tags: ',A$tags[i],' ',getSeq(Mmusculus, as.character(A$chr[i]), A$start[i],
> A$end[i])))
> 
> As you see, it's made up of ideas from all three respondents. (Michael's
> solution actually gave me an idea to solve some other problem not mentioned
> here. So, nothing gets wasted.)
> 
> Sorry for the off-list emailing issue.
> 
> Ivan
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list