[Bioc-sig-seq] adapter removal

Victor Ruotti ruotti at wisc.edu
Sat Jan 31 00:57:34 CET 2009


Maybe a not a straight topic related thing to adapter removal...  
However, do you guys have a simple way to split the fastq files once  
loaded into biostrings?
Say you load a whole lane and now want to split 12M reads and their  
qualities into multiple fastq files from processing with Maq, gmap, etc?
Perhaps this should be done outside or R? Just wanted to see what you  
think...
See below
...

 From the Grid Engine Life Sciense SIG

Can you share the program you use to split the reads?
I'm the process of writing a fasta/fastq/seq/prb splitter. I use Perl  
and thought about starting a bioperl module for next gen stuff.
They already have a bunch of modules to deal with qualities, so I was  
thinking adding a method to split fastq files for maq/gmap processing  
would be something good to have in bioperl. Great work had also being  
done with biostrings and maybe Martin can comment on this.

As simple as it might sound it would be good to have bioperl and maybe  
biostrings setup for this.
Will try to post this in the biostrings forum as well.
Any thoughts/interest on this?

Victor


On Jan 18, 2009, at 7:59 PM, Patrick Aboyoun wrote:

> Kasper,
> Yes, but there is between 12 - 36 delay between an svn checkin and a  
> package being available at bioconductor.org.
>
>
> Patrick
>
>
> Quoting Kasper Daniel Hansen <khansen at stat.berkeley.edu>:
>
>> Shouldn't biocLite pick up recent additions to the subversion
>> repository, provided that you are using R-devel and you install using
>> pkgType = "source"?
>>
>> Kasper
>>
>> On Jan 17, 2009, at 19:24 , Patrick Aboyoun wrote:
>>
>>> Joe,
>>> I have been making some modifications to trimLRPatterns both  
>>> today  and in recent days, so you may need to get the latest  
>>> version of  Biostrings directly from svn rather than using  
>>> biocLite from within  R. Once you have a recently sufficient  
>>> version, the key is in the  construction of the max.Rmismatch  
>>> argument. Below are some examples  they achieve the result you are  
>>> looking for. The man page for  trimLRPatterns has a detailed  
>>> description on various types of  inputs that are accepted by the  
>>> max.Rmismatch argument.
>>>
>>>
>>>> suppressMessages(library(Biostrings))
>>>> Rpattern <- "CTGTAGGCACCA"
>>>> subjectSet <-
>>> + DNAStringSet(c("GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA",
>>> +                "GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC"))
>>>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
>>> +                max.Rmismatch = rep(2, 12))
>>> A DNAStringSet instance of length 2
>>>  width seq
>>> [1]    22 GCTGGAACCCAGGGTGTTGTAC
>>> [2]    24 GTAAGACCATACTTGGCCGAATGC
>>>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
>>> +                max.Rmismatch = 0.2)
>>> A DNAStringSet instance of length 2
>>>  width seq
>>> [1]    22 GCTGGAACCCAGGGTGTTGTAC
>>> [2]    24 GTAAGACCATACTTGGCCGAATGC
>>>> sessionInfo()
>>> R version 2.9.0 Under development (unstable) (2009-01-15 r47619)
>>> i386-apple-darwin9.6.0
>>>
>>> locale:
>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.11.25 IRanges_1.1.34
>>>
>>> loaded via a namespace (and not attached):
>>> [1] grid_2.9.0         lattice_0.17-20    Matrix_0.999375-17
>>>
>>>
>>> Patrick
>>>
>>>
>>> Quoting joseph franklin <joseph.franklin at yale.edu>:
>>>
>>>> Patrick,
>>>>
>>>> This adapter tool looks extremely useful for my purposes: removing
>>>> adapters from smRNA reads to estimate the short template lengths.
>>>> Forgive me if the answer to this is obvious, but everything seems  
>>>> to
>>>> work with trimLRPatterns, except that it doesn't seem to allow the
>>>> Rpattern or Lpattern to slide along the sequence (at least using  
>>>> the
>>>> default settings--see below).  Rather it looks only for exact  
>>>> matches,
>>>> that leave no overhang.  Thus:
>>>>
>>>>> Rpattern <- "CTGTAGGCACCA"
>>>>
>>>> trims:
>>>>
>>>> [6]    34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA
>>>>
>>>> nicely, to:
>>>>
>>>> [6]    22 GCTGGAACCCAGGGTGTTGTAC
>>>>
>>>>
>>>> but a sequence where resulting in an Rpattern overhang (here ~2nt):
>>>>
>>>> [90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>>>
>>>> is not trimmed at all:
>>>>
>>>> [90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>>>                                                    :
>>>>
>>>> What can I do to allow for flexibility at the overhanging end?
>>>>
>>>>
>>>> Again, thanks very much.
>>>> Joe
>>>>
>>>>
>>>> On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:
>>>>
>>>> I just checked in a trimLRPatterns function to the Bioconductor svn
>>>> repository for BioC 2.4. Its signature is
>>>>
>>>> trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
>>>>             max.Lmismatch = 0, max.Rmismatch = 0,
>>>>             with.Lindels = FALSE, with.Rindels = FALSE,
>>>>             Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)
>>>>
>>>> As you can infer from the arguments, this function allows the  
>>>> user to
>>>> set the # of mismatches (if with.*indels = FALSE) / edit distance  
>>>> (if
>>>> with.*indels = TRUE) for the left and right flanking "patterns". It
>>>> also allows for IUPAC ambiguity letters in these flanking regions  
>>>> if
>>>> *fixed = FALSE. When ranges = FALSE, trimLRPatterns returns the  
>>>> trimmed
>>>> strings. When ranges = TRUE, it returns the ranges that you can  
>>>> use to
>>>> trim the strings. Here are some examples:
>>>>
>>>>> Lpattern <- "TTCTGCTTG"
>>>>> Rpattern <- "GATCGGAAG"
>>>>> subject <- DNAString("TTCTGCTTGACGTGATCGGA")
>>>>> subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG",  
>>>>> "TTCTGCTTGGATCGGAAG"))
>>>>> trimLRPatterns(Lpattern = Lpattern, subject = subject)
>>>> 11-letter "DNAString" instance
>>>> seq: ACGTGATCGGA
>>>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>>>> subjectSet)
>>>> A DNAStringSet instance of length 2
>>>> width seq
>>>> [1]    18 TGCTTGACGGCAGATCGG
>>>> [2]     0
>>>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>>>> subjectSet,
>>>> +                  ranges = TRUE)
>>>> IRanges object:
>>>> start end width
>>>> 1     1  18    18
>>>> 2    10   9     0
>>>>
>>>> This functionality will be available on bioconductor.org (and
>>>> downloadable via biocLite) in the next day or so, but you can  
>>>> also grab
>>>> Biostrings from svn directly if you need it sooner. It will also  
>>>> feed
>>>> its way into Biostrings documentation and training material  
>>>> before the
>>>> next release of Bioconductor in May.
>>>>
>>>>
>>>> Patrick
>>>>
>>>>
>>>>
>>>> Patrick Aboyoun wrote:
>>>>> David,
>>>>> Following up on Martin's comments, I am putting the finishing    
>>>>> touches on a function called trimLRPatterns for the Biostrings    
>>>>> package. Its purpose is to trim left and/or right flanking   
>>>>> patterns  from sequences, so it can strip 5' and/or 3' adapters   
>>>>> from your  reads. The signature for this function is
>>>>>
>>>>> trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject,   
>>>>> max.Lnedit=0,  max.Rnedit=0,
>>>>>            with.Lindels=FALSE, with.Rindels=FALSE,  
>>>>> Lfixed=TRUE,   Rfixed=TRUE,
>>>>>            rangesOnly = FALSE)
>>>>>
>>>>> I will be checking this function into the BioC 2.4 code line,   
>>>>> which  requires using R-devel, sometime today or tomorrow. I  
>>>>> will  send out  an e-mail to this group when I check it in and  
>>>>> show a  simple  example of its usage. I talked with Martin and  
>>>>> he will  wrap this  functionality in the ShortRead layer so you  
>>>>> don't have  to leave the  ShortRead class system when removing  
>>>>> adapters from  your reads.
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Patrick
>>>>>
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



More information about the Bioc-sig-sequencing mailing list