[BioC] Restriction enzyme digestion
Herve Pages
hpages at fhcrc.org
Wed May 14 21:05:20 CEST 2008
Hi Anh,
Biostrings 2.8.6 (latest release version) was pushed to the public repository just a
few minutes ago:
http://bioconductor.org/packages/2.2/bioc/html/Biostrings.html
Can you please try again and let me know if there are further problems.
Thanks!
H.
Anh Tran wrote:
> Hi,
> I updated my system to R 2.7.0 but still cannot see the update of the
> Biostrings package (2.8.5). I'm using MacOSX and will test the Windows box.
>
> I'm installing using the script from bioconductor:
>
> source("http://bioconductor.org/biocLite.R")
> biocLite()
>
> Please let me know if there's something I should be paying attention to.
>
> Thanks.
>
> On Tue, May 13, 2008 at 11:23 AM, Anh Tran <popophobia at gmail.com
> <mailto:popophobia at gmail.com>> wrote:
>
> Thank you very much.
>
> I'll check it out tonight. In the mean time, I'll have to set up R
> 2.7.0. <http://2.7.0.>
>
> Thanks again.
>
>
> On Tue, May 13, 2008 at 10:05 AM, Herve Pages <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi Anh,
>
> I fixed a few things in Biostrings:
>
> - The "as.character" method for XStringViews (and XStringSet)
> objects is much
> faster now (about 250x faster than yesterday for your use case):
>
>
> library(BSgenome.Hsapiens.UCSC.hg18)
> c1 <- Hsapiens$chr1
> m <- matchPattern('CTAG', c1)
> mgaps <- gaps(m)
> mgaps2 <- as.character(mgaps) # now takes about 5 sec.
> instead of 20 minutes
>
> - As a consequence, the getSeq() function is much faster too (it
> uses as.character()
> internally):
>
> mgaps3_start <- end(m)[-length(m)] + 1L
> mgaps3_end <- start(m)[-1] - 1L
> ## The empty gaps must be dropped because the 'start' and
> 'end' args of getSeq()
> ## must verify 'start' <= 'end'
> nonemptygaps <- mgaps3_start <= mgaps3_end
> mgaps3_start <- mgaps3_start[nonemptygaps]
> mgaps3_end <- mgaps3_end[nonemptygaps]
> mgaps3 <- getSeq(Hsapiens, 'chr1', start=mgaps3_start,
> end=mgaps3_end)
>
> identical(mgaps3, mgaps2) # TRUE
>
> This will be available in BioC 2.2 (Biostrings 2.8.5) tonight
> around midnight
> (Seattle time) for R-2.7 users and in BioC 2.3 (Biostrings
> 2.9.8) around noon
> (Seattle time). As always, install with biocLite().
>
> Please let me know by posting here if you run into other
> problems. Thanks!
>
> H.
>
>
>
> hpages at fhcrc.org <mailto:hpages at fhcrc.org> wrote:
>
> Hi Anh,
>
> Quoting Anh Tran <popophobia at gmail.com
> <mailto:popophobia at gmail.com>>:
>
> Hi all,
>
> I'm new to this mailing list and have a very simple
> question about digestion
> with restriction enzyme for the whole genome.
>
> I'm using package BSgenome.Hsapiens.UCSC.hg18 to find
> the cut site of
> enzyme. Is there a faster way (ie. pre-built package) to
> get the fragment
> sequence and it's location in UCSC genome browser format
> (Chr#: start bp -
> end bp).
>
> I'm planning on finding the restriction site location,
> then find the
> substring between two consecutive cut sites.
>
> Is there any faster way to do this?
>
> Here's the code that I'm thinking of:
>
> library(BSgenome.Hsapiens.UCSC.hg18)
> c1<-Hsapiens$chr1
> m<-matchPatterns('CTAG', c1)
>
> #And the for loop
> reslt<-NULL
> for (i in 1:(length(m)-1)) {
> reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3,
> getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2)))
> }
>
> It seems like this code is no where near perfect. Please
> give me some input.
>
>
> Yes looping on an 623303-element vector at the R level will
> take a long
> time. To "invert" the views of an XStringViews object i.e.
> to make the
> current gaps between the views become the new views and the
> current views
> become the new gaps, just do:
>
> mgaps <- gaps(m)
>
> This is very fast! (< 0.3 sec. on my machine)
>
> Then if you really want this as a standard character vector, do:
>
> mgaps2 <- as.character(mgaps)
>
> but, unfortunately this is still very slow at the moment
> because the
> main loop in the "as.character" method for XStringViews
> objects is still
> written in R and not in C (we need to change this ASAP).
>
> You could also use the getSeq() function in a vectorized way
> if you
> really wanted to use getSeq(). Then your code would look like:
>
> reslt_start <- end(m)[-length(m)] + 1L
> reslt_end <- start(m)[-1] - 1L
> reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2,
> start(m)[i+1]-2))
>
> at least in theory... but I realized that there are a few
> problems
> with the current getSeq() implementation that I need to
> address too,
> sorry!
>
> Note that you'll need the latest available version of
> Biostrings in order
> to use the gaps() function.
>
> Cheers,
> H.
>
>
>
> Thanks
>
> --
> Regards,
> Anh Tran
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> <mailto:Bioconductor at stat.math.ethz.ch>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> <mailto:Bioconductor at stat.math.ethz.ch>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
>
> --
> Regards,
> Anh Tran
>
>
>
>
> --
> Regards,
> Anh Tran
More information about the Bioconductor
mailing list