[BioC] fast iterator over DNAString's?

Hervé Pagès hpages at fhcrc.org
Thu Mar 11 19:28:34 CET 2010


Hi Paul,

Paul Shannon wrote:
> I wish to trim a variable length sequence from the end of many thousands of DNAStrings in a DNAStringSet.  
> 
> The sequence to be trimmed is any recognizable chunk of a solexa short read adapter, which ends up on the end of, for example, 22nt miRNAs.  The adapter chunk might be found in the middle of a 35 base read, or it might be closer to the end.  In every case, I want to delete every base from the start of the adapter chunk to the end of the read.
> 
> I imagine there might be a BString operation equivalent to sed.  See could be used ike this:
> 
>   echo 'CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT' | sed s/TCGTATGCCGTC.*$//      --> GAAGCGGGATGATCTATC

With standard string functions:

   x <- "CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT"
   pattern <- "TCGTATGCCGTC"

   > sub(paste(pattern, ".*$", sep=""), "", x)
   [1] "CGAAGCGGGATGATCTATC"

Note that the 1st letter ("C") has disappeared in your example.

> 
> (where TCGTATGCCGTC is only part of the 21-base adapter, but is probably a long enough portion to be representative)
> 
> Any way to do this with BStrings and friends?

Here is the solution to Steve's exercise. As he suggested, the
building blocks are vmatchPattern(), startIndex() and subseq<-():

   lefttrimFromFirstPatternStart <- function(pattern, subject)
   {
     mi <- vmatchPattern(pattern, subject)
     start_at <- sapply(startIndex(mi), function(start) start[1L])
     start_at[is.na(start_at)] <- width(subject)[is.na(start_at)] + 1L
     subseq(subject, start=start_at) <- ""
     subject
   }

Note that it could be that the pattern has 0 or more than 1 hit in a
given subject element. The above function tries to accommodate this.

Then:

   x <- DNAStringSet(c(x, paste(rep.int("A", 35), collapse="")))

   > x
     A DNAStringSet instance of length 2
       width seq
   [1]    35 CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT
   [2]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

   > lefttrimFromFirstPatternStart(pattern, x)
     A DNAStringSet instance of length 2
       width seq
   [1]    19 CGAAGCGGGATGATCTATC
   [2]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Also note that vmatchPattern() lets the user have some control over
mismatches/indels (see ?vmatchPattern) so you can add this level
of control to lefttrimFromFirstPatternStart() too by adding
the 'max.mismatch', 'min.mismatch', 'with.indels' and 'fixed'
arguments to it (with the same default values as in vmatchPattern)
and just passing their values to vmatchPattern().

Cheers,
H.


> 
> Thanks!
> 
>  - Paul
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list