[BioC] Restriction enzyme digestion

hpages at fhcrc.org hpages at fhcrc.org
Tue May 13 01:35:30 CEST 2008


Hi Anh,

Quoting Anh Tran <popophobia at gmail.com>:

> Hi all,
>
> I'm new to this mailing list and have a very simple question about digestion
> with restriction enzyme for the whole genome.
>
> I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut site of
> enzyme. Is there a faster way (ie. pre-built package) to get the fragment
> sequence and it's location in UCSC genome browser format (Chr#: start bp -
> end bp).
>
> I'm planning on finding the restriction site location, then find the
> substring between two consecutive cut sites.
>
> Is there any faster way to do this?
>
> Here's the code that I'm thinking of:
>
> library(BSgenome.Hsapiens.UCSC.hg18)
> c1<-Hsapiens$chr1
> m<-matchPatterns('CTAG', c1)
>
> #And the for loop
> reslt<-NULL
> for (i in 1:(length(m)-1)) {
>    reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3,
> getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2)))
> }
>
> It seems like this code is no where near perfect. Please give me some input.

Yes looping on an 623303-element vector at the R level will take a long
time. To "invert" the views of an XStringViews object i.e. to make the
current gaps between the views become the new views and the current views
become the new gaps, just do:

   mgaps <- gaps(m)

This is very fast! (< 0.3 sec. on my machine)

Then if you really want this as a standard character vector, do:

   mgaps2 <- as.character(mgaps)

but, unfortunately this is still very slow at the moment because the
main loop in the "as.character" method for XStringViews objects is still
written in R and not in C (we need to change this ASAP).

You could also use the getSeq() function in a vectorized way if you
really wanted to use getSeq(). Then your code would look like:

   reslt_start <- end(m)[-length(m)] + 1L
   reslt_end <- start(m)[-1] - 1L
   reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2, start(m)[i+1]-2))

at least in theory... but I realized that there are a few problems
with the current getSeq() implementation that I need to address too,
sorry!

Note that you'll need the latest available version of Biostrings in order
to use the gaps() function.

Cheers,
H.


>
> Thanks
>
> --
> Regards,
> Anh Tran
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:   
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list