[Bioc-sig-seq] adapter removal

joseph franklin joseph.franklin at yale.edu
Sat Jan 17 21:04:14 CET 2009


Patrick,

This adapter tool looks extremely useful for my purposes: removing  
adapters from smRNA reads to estimate the short template lengths.   
Forgive me if the answer to this is obvious, but everything seems to  
work with trimLRPatterns, except that it doesn't seem to allow the  
Rpattern or Lpattern to slide along the sequence (at least using the  
default settings--see below).  Rather it looks only for exact matches,  
that leave no overhang.  Thus:

 > Rpattern <- "CTGTAGGCACCA"

trims:

  [6]    34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA

nicely, to:

  [6]    22 GCTGGAACCCAGGGTGTTGTAC


but a sequence where resulting in an Rpattern overhang (here ~2nt):

[90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC

is not trimmed at all:

[90]    34  
GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC 
                                                                                  :

What can I do to allow for flexibility at the overhanging end?


Again, thanks very much.
Joe


On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:

I just checked in a trimLRPatterns function to the Bioconductor svn  
repository for BioC 2.4. Its signature is

trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
                max.Lmismatch = 0, max.Rmismatch = 0,
                with.Lindels = FALSE, with.Rindels = FALSE,
                Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)

As you can infer from the arguments, this function allows the user to  
set the # of mismatches (if with.*indels = FALSE) / edit distance (if  
with.*indels = TRUE) for the left and right flanking "patterns". It  
also allows for IUPAC ambiguity letters in these flanking regions if  
*fixed = FALSE. When ranges = FALSE, trimLRPatterns returns the  
trimmed strings. When ranges = TRUE, it returns the ranges that you  
can use to trim the strings. Here are some examples:

 >   Lpattern <- "TTCTGCTTG"
 >   Rpattern <- "GATCGGAAG"
 >   subject <- DNAString("TTCTGCTTGACGTGATCGGA")
 >   subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG",  
"TTCTGCTTGGATCGGAAG"))
 >   trimLRPatterns(Lpattern = Lpattern, subject = subject)
11-letter "DNAString" instance
seq: ACGTGATCGGA
 >   trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject  
= subjectSet)
A DNAStringSet instance of length 2
   width seq
[1]    18 TGCTTGACGGCAGATCGG
[2]     0
 >   trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject  
= subjectSet,
+                  ranges = TRUE)
IRanges object:
start end width
1     1  18    18
2    10   9     0

This functionality will be available on bioconductor.org (and  
downloadable via biocLite) in the next day or so, but you can also  
grab Biostrings from svn directly if you need it sooner. It will also  
feed its way into Biostrings documentation and training material  
before the next release of Bioconductor in May.


Patrick



Patrick Aboyoun wrote:
> David,
> Following up on Martin's comments, I am putting the finishing  
> touches on a function called trimLRPatterns for the Biostrings  
> package. Its purpose is to trim left and/or right flanking patterns  
> from sequences, so it can strip 5' and/or 3' adapters from your  
> reads. The signature for this function is
>
> trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject, max.Lnedit=0,  
> max.Rnedit=0,
>                with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE,  
> Rfixed=TRUE,
>                rangesOnly = FALSE)
>
> I will be checking this function into the BioC 2.4 code line, which  
> requires using R-devel, sometime today or tomorrow. I will send out  
> an e-mail to this group when I check it in and show a simple example  
> of its usage. I talked with Martin and he will wrap this  
> functionality in the ShortRead layer so you don't have to leave the  
> ShortRead class system when removing adapters from your reads.
>
>
> Cheers,
> Patrick
>



More information about the Bioc-sig-sequencing mailing list