[Bioc-sig-seq] adapter removal

Patrick Aboyoun paboyoun at fhcrc.org
Mon Jan 19 02:59:41 CET 2009


Kasper,
Yes, but there is between 12 - 36 delay between an svn checkin and a  
package being available at bioconductor.org.


Patrick


Quoting Kasper Daniel Hansen <khansen at stat.berkeley.edu>:

> Shouldn't biocLite pick up recent additions to the subversion
> repository, provided that you are using R-devel and you install using
> pkgType = "source"?
>
> Kasper
>
> On Jan 17, 2009, at 19:24 , Patrick Aboyoun wrote:
>
>> Joe,
>> I have been making some modifications to trimLRPatterns both today   
>> and in recent days, so you may need to get the latest version of   
>> Biostrings directly from svn rather than using biocLite from within  
>>  R. Once you have a recently sufficient version, the key is in the   
>> construction of the max.Rmismatch argument. Below are some examples  
>>  they achieve the result you are looking for. The man page for   
>> trimLRPatterns has a detailed description on various types of   
>> inputs that are accepted by the max.Rmismatch argument.
>>
>>
>>> suppressMessages(library(Biostrings))
>>> Rpattern <- "CTGTAGGCACCA"
>>> subjectSet <-
>> + DNAStringSet(c("GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA",
>> +                "GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC"))
>>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
>> +                max.Rmismatch = rep(2, 12))
>> A DNAStringSet instance of length 2
>>   width seq
>> [1]    22 GCTGGAACCCAGGGTGTTGTAC
>> [2]    24 GTAAGACCATACTTGGCCGAATGC
>>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
>> +                max.Rmismatch = 0.2)
>> A DNAStringSet instance of length 2
>>   width seq
>> [1]    22 GCTGGAACCCAGGGTGTTGTAC
>> [2]    24 GTAAGACCATACTTGGCCGAATGC
>>> sessionInfo()
>> R version 2.9.0 Under development (unstable) (2009-01-15 r47619)
>> i386-apple-darwin9.6.0
>>
>> locale:
>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] Biostrings_2.11.25 IRanges_1.1.34
>>
>> loaded via a namespace (and not attached):
>> [1] grid_2.9.0         lattice_0.17-20    Matrix_0.999375-17
>>
>>
>> Patrick
>>
>>
>> Quoting joseph franklin <joseph.franklin at yale.edu>:
>>
>>> Patrick,
>>>
>>> This adapter tool looks extremely useful for my purposes: removing
>>> adapters from smRNA reads to estimate the short template lengths.
>>> Forgive me if the answer to this is obvious, but everything seems to
>>> work with trimLRPatterns, except that it doesn't seem to allow the
>>> Rpattern or Lpattern to slide along the sequence (at least using the
>>> default settings--see below).  Rather it looks only for exact matches,
>>> that leave no overhang.  Thus:
>>>
>>>> Rpattern <- "CTGTAGGCACCA"
>>>
>>> trims:
>>>
>>> [6]    34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA
>>>
>>> nicely, to:
>>>
>>> [6]    22 GCTGGAACCCAGGGTGTTGTAC
>>>
>>>
>>> but a sequence where resulting in an Rpattern overhang (here ~2nt):
>>>
>>> [90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>>
>>> is not trimmed at all:
>>>
>>> [90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>>                                                     :
>>>
>>> What can I do to allow for flexibility at the overhanging end?
>>>
>>>
>>> Again, thanks very much.
>>> Joe
>>>
>>>
>>> On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:
>>>
>>> I just checked in a trimLRPatterns function to the Bioconductor svn
>>> repository for BioC 2.4. Its signature is
>>>
>>> trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
>>>              max.Lmismatch = 0, max.Rmismatch = 0,
>>>              with.Lindels = FALSE, with.Rindels = FALSE,
>>>              Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)
>>>
>>> As you can infer from the arguments, this function allows the user to
>>> set the # of mismatches (if with.*indels = FALSE) / edit distance (if
>>> with.*indels = TRUE) for the left and right flanking "patterns". It
>>> also allows for IUPAC ambiguity letters in these flanking regions if
>>> *fixed = FALSE. When ranges = FALSE, trimLRPatterns returns the trimmed
>>> strings. When ranges = TRUE, it returns the ranges that you can use to
>>> trim the strings. Here are some examples:
>>>
>>>> Lpattern <- "TTCTGCTTG"
>>>> Rpattern <- "GATCGGAAG"
>>>> subject <- DNAString("TTCTGCTTGACGTGATCGGA")
>>>> subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG", "TTCTGCTTGGATCGGAAG"))
>>>> trimLRPatterns(Lpattern = Lpattern, subject = subject)
>>> 11-letter "DNAString" instance
>>> seq: ACGTGATCGGA
>>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>>> subjectSet)
>>> A DNAStringSet instance of length 2
>>> width seq
>>> [1]    18 TGCTTGACGGCAGATCGG
>>> [2]     0
>>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>>> subjectSet,
>>> +                  ranges = TRUE)
>>> IRanges object:
>>> start end width
>>> 1     1  18    18
>>> 2    10   9     0
>>>
>>> This functionality will be available on bioconductor.org (and
>>> downloadable via biocLite) in the next day or so, but you can also grab
>>> Biostrings from svn directly if you need it sooner. It will also feed
>>> its way into Biostrings documentation and training material before the
>>> next release of Bioconductor in May.
>>>
>>>
>>> Patrick
>>>
>>>
>>>
>>> Patrick Aboyoun wrote:
>>>> David,
>>>> Following up on Martin's comments, I am putting the finishing    
>>>> touches on a function called trimLRPatterns for the Biostrings    
>>>> package. Its purpose is to trim left and/or right flanking   
>>>> patterns  from sequences, so it can strip 5' and/or 3' adapters   
>>>> from your  reads. The signature for this function is
>>>>
>>>> trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject,   
>>>> max.Lnedit=0,  max.Rnedit=0,
>>>>             with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE,   
>>>>  Rfixed=TRUE,
>>>>             rangesOnly = FALSE)
>>>>
>>>> I will be checking this function into the BioC 2.4 code line,   
>>>> which  requires using R-devel, sometime today or tomorrow. I will  
>>>>  send out  an e-mail to this group when I check it in and show a   
>>>> simple  example of its usage. I talked with Martin and he will   
>>>> wrap this  functionality in the ShortRead layer so you don't have  
>>>>  to leave the  ShortRead class system when removing adapters from  
>>>>  your reads.
>>>>
>>>>
>>>> Cheers,
>>>> Patrick
>>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



More information about the Bioc-sig-sequencing mailing list