[BioC] Restriction enzyme digestion

Herve Pages hpages at fhcrc.org
Wed May 14 21:05:20 CEST 2008


Hi Anh,

Biostrings 2.8.6 (latest release version) was pushed to the public repository just a
few minutes ago:

   http://bioconductor.org/packages/2.2/bioc/html/Biostrings.html

Can you please try again and let me know if there are further problems.

Thanks!
H.


Anh Tran wrote:
> Hi,
> I updated my system to R 2.7.0 but still cannot see the update of the 
> Biostrings package (2.8.5). I'm using MacOSX and will test the Windows box.
> 
> I'm installing using the script from bioconductor:
> 
> source("http://bioconductor.org/biocLite.R")
> biocLite()
> 
> Please let me know if there's something I should be paying attention to.
> 
> Thanks.
> 
> On Tue, May 13, 2008 at 11:23 AM, Anh Tran <popophobia at gmail.com 
> <mailto:popophobia at gmail.com>> wrote:
> 
>     Thank you very much.
> 
>     I'll check it out tonight. In the mean time, I'll have to set up R
>     2.7.0. <http://2.7.0.>
> 
>     Thanks again.
> 
> 
>     On Tue, May 13, 2008 at 10:05 AM, Herve Pages <hpages at fhcrc.org
>     <mailto:hpages at fhcrc.org>> wrote:
> 
>         Hi Anh,
> 
>         I fixed a few things in Biostrings:
> 
>         - The "as.character" method for XStringViews (and XStringSet)
>         objects is much
>          faster now (about 250x faster than yesterday for your use case):
> 
> 
>            library(BSgenome.Hsapiens.UCSC.hg18)
>            c1 <- Hsapiens$chr1
>            m <- matchPattern('CTAG', c1)
>            mgaps <- gaps(m)
>            mgaps2 <- as.character(mgaps)  # now takes about 5 sec.
>         instead of 20 minutes
> 
>         - As a consequence, the getSeq() function is much faster too (it
>         uses as.character()
>          internally):
> 
>            mgaps3_start <- end(m)[-length(m)] + 1L
>            mgaps3_end <- start(m)[-1] - 1L
>            ## The empty gaps must be dropped because the 'start' and
>         'end' args of getSeq()
>            ## must verify 'start' <= 'end'
>            nonemptygaps <- mgaps3_start <= mgaps3_end
>            mgaps3_start <- mgaps3_start[nonemptygaps]
>            mgaps3_end <- mgaps3_end[nonemptygaps]
>            mgaps3 <- getSeq(Hsapiens, 'chr1', start=mgaps3_start,
>         end=mgaps3_end)
> 
>            identical(mgaps3, mgaps2)  # TRUE
> 
>         This will be available in BioC 2.2 (Biostrings 2.8.5) tonight
>         around midnight
>         (Seattle time) for R-2.7 users and in BioC 2.3 (Biostrings
>         2.9.8) around noon
>         (Seattle time). As always, install with biocLite().
> 
>         Please let me know by posting here if you run into other
>         problems. Thanks!
> 
>         H.
> 
> 
> 
>         hpages at fhcrc.org <mailto:hpages at fhcrc.org> wrote:
> 
>             Hi Anh,
> 
>             Quoting Anh Tran <popophobia at gmail.com
>             <mailto:popophobia at gmail.com>>:
> 
>                 Hi all,
> 
>                 I'm new to this mailing list and have a very simple
>                 question about digestion
>                 with restriction enzyme for the whole genome.
> 
>                 I'm using package BSgenome.Hsapiens.UCSC.hg18 to find
>                 the cut site of
>                 enzyme. Is there a faster way (ie. pre-built package) to
>                 get the fragment
>                 sequence and it's location in UCSC genome browser format
>                 (Chr#: start bp -
>                 end bp).
> 
>                 I'm planning on finding the restriction site location,
>                 then find the
>                 substring between two consecutive cut sites.
> 
>                 Is there any faster way to do this?
> 
>                 Here's the code that I'm thinking of:
> 
>                 library(BSgenome.Hsapiens.UCSC.hg18)
>                 c1<-Hsapiens$chr1
>                 m<-matchPatterns('CTAG', c1)
> 
>                 #And the for loop
>                 reslt<-NULL
>                 for (i in 1:(length(m)-1)) {
>                   reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3,
>                 getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2)))
>                 }
> 
>                 It seems like this code is no where near perfect. Please
>                 give me some input.
> 
> 
>             Yes looping on an 623303-element vector at the R level will
>             take a long
>             time. To "invert" the views of an XStringViews object i.e.
>             to make the
>             current gaps between the views become the new views and the
>             current views
>             become the new gaps, just do:
> 
>              mgaps <- gaps(m)
> 
>             This is very fast! (< 0.3 sec. on my machine)
> 
>             Then if you really want this as a standard character vector, do:
> 
>              mgaps2 <- as.character(mgaps)
> 
>             but, unfortunately this is still very slow at the moment
>             because the
>             main loop in the "as.character" method for XStringViews
>             objects is still
>             written in R and not in C (we need to change this ASAP).
> 
>             You could also use the getSeq() function in a vectorized way
>             if you
>             really wanted to use getSeq(). Then your code would look like:
> 
>              reslt_start <- end(m)[-length(m)] + 1L
>              reslt_end <- start(m)[-1] - 1L
>              reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2,
>             start(m)[i+1]-2))
> 
>             at least in theory... but I realized that there are a few
>             problems
>             with the current getSeq() implementation that I need to
>             address too,
>             sorry!
> 
>             Note that you'll need the latest available version of
>             Biostrings in order
>             to use the gaps() function.
> 
>             Cheers,
>             H.
> 
> 
> 
>                 Thanks
> 
>                 -- 
>                 Regards,
>                 Anh Tran
> 
>                    [[alternative HTML version deleted]]
> 
>                 _______________________________________________
>                 Bioconductor mailing list
>                 Bioconductor at stat.math.ethz.ch
>                 <mailto:Bioconductor at stat.math.ethz.ch>
>                 https://stat.ethz.ch/mailman/listinfo/bioconductor
>                 Search the archives:
>                  http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
>             _______________________________________________
>             Bioconductor mailing list
>             Bioconductor at stat.math.ethz.ch
>             <mailto:Bioconductor at stat.math.ethz.ch>
>             https://stat.ethz.ch/mailman/listinfo/bioconductor
>             Search the archives:
>             http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> 
> 
> 
>     -- 
>     Regards,
>     Anh Tran 
> 
> 
> 
> 
> -- 
> Regards,
> Anh Tran



More information about the Bioconductor mailing list