[Bioc-sig-seq] adapter removal

Martin Morgan mtmorgan at fhcrc.org
Sat Jan 31 05:29:39 CET 2009


Victor Ruotti <ruotti at wisc.edu> writes:

> Maybe a not a straight topic related thing to adapter removal...
> However, do you guys have a simple way to split the fastq files once
> loaded into biostrings?

you can just subset the object you've read in.

> library(ShortRead)
> example(readFastq)
[snip]
> rfq1
class: ShortReadQ
length: 256 reads; width: 36 cycles
> rfq[1:20]
class: ShortReadQ
length: 20 reads; width: 36 cycles
> writeFastq(rfq[1:2], file="/tmp/firstFive.fastq")
> readLines("/tmp/firstFive.fastq")
[1] "@HWI-EAS88_1_1_1_1001_499"           
[2] "GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT"
[3] "+HWI-EAS88_1_1_1_1001_499"           
[4] "]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS"
[5] "@HWI-EAS88_1_1_1_898_392"            
[6] "GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC"
[7] "+HWI-EAS88_1_1_1_898_392"            
[8] "]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK"

standard R code could be used to identify five more-or-less equal
chunks (e.g., using as.integer(cut(seq_len(length(rfq)), nchunks))) or
other operations.

Was that what you were thinking of?

Martin

> Say you load a whole lane and now want to split 12M reads and their
> qualities into multiple fastq files from processing with Maq, gmap,
> etc?
> Perhaps this should be done outside or R? Just wanted to see what you
> think...
> See below
> ...
>
>  From the Grid Engine Life Sciense SIG
>
> Can you share the program you use to split the reads?
> I'm the process of writing a fasta/fastq/seq/prb splitter. I use Perl
> and thought about starting a bioperl module for next gen stuff.
> They already have a bunch of modules to deal with qualities, so I was
> thinking adding a method to split fastq files for maq/gmap processing
> would be something good to have in bioperl. Great work had also being
> done with biostrings and maybe Martin can comment on this.
>
> As simple as it might sound it would be good to have bioperl and maybe
> biostrings setup for this.
> Will try to post this in the biostrings forum as well.
> Any thoughts/interest on this?
>
> Victor
>
>
> On Jan 18, 2009, at 7:59 PM, Patrick Aboyoun wrote:
>
>> Kasper,
>> Yes, but there is between 12 - 36 delay between an svn checkin and a
>> package being available at bioconductor.org.
>>
>>
>> Patrick
>>
>>
>> Quoting Kasper Daniel Hansen <khansen at stat.berkeley.edu>:
>>
>>> Shouldn't biocLite pick up recent additions to the subversion
>>> repository, provided that you are using R-devel and you install using
>>> pkgType = "source"?
>>>
>>> Kasper
>>>
>>> On Jan 17, 2009, at 19:24 , Patrick Aboyoun wrote:
>>>
>>>> Joe,
>>>> I have been making some modifications to trimLRPatterns both
>>>> today  and in recent days, so you may need to get the latest
>>>> version of  Biostrings directly from svn rather than using
>>>> biocLite from within  R. Once you have a recently sufficient
>>>> version, the key is in the  construction of the max.Rmismatch
>>>> argument. Below are some examples  they achieve the result you are
>>>> looking for. The man page for  trimLRPatterns has a detailed
>>>> description on various types of  inputs that are accepted by the
>>>> max.Rmismatch argument.
>>>>
>>>>
>>>>> suppressMessages(library(Biostrings))
>>>>> Rpattern <- "CTGTAGGCACCA"
>>>>> subjectSet <-
>>>> + DNAStringSet(c("GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA",
>>>> +                "GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC"))
>>>>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
>>>> +                max.Rmismatch = rep(2, 12))
>>>> A DNAStringSet instance of length 2
>>>>  width seq
>>>> [1]    22 GCTGGAACCCAGGGTGTTGTAC
>>>> [2]    24 GTAAGACCATACTTGGCCGAATGC
>>>>> trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
>>>> +                max.Rmismatch = 0.2)
>>>> A DNAStringSet instance of length 2
>>>>  width seq
>>>> [1]    22 GCTGGAACCCAGGGTGTTGTAC
>>>> [2]    24 GTAAGACCATACTTGGCCGAATGC
>>>>> sessionInfo()
>>>> R version 2.9.0 Under development (unstable) (2009-01-15 r47619)
>>>> i386-apple-darwin9.6.0
>>>>
>>>> locale:
>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] Biostrings_2.11.25 IRanges_1.1.34
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] grid_2.9.0         lattice_0.17-20    Matrix_0.999375-17
>>>>
>>>>
>>>> Patrick
>>>>
>>>>
>>>> Quoting joseph franklin <joseph.franklin at yale.edu>:
>>>>
>>>>> Patrick,
>>>>>
>>>>> This adapter tool looks extremely useful for my purposes: removing
>>>>> adapters from smRNA reads to estimate the short template lengths.
>>>>> Forgive me if the answer to this is obvious, but everything seems
>>>>> to
>>>>> work with trimLRPatterns, except that it doesn't seem to allow the
>>>>> Rpattern or Lpattern to slide along the sequence (at least using
>>>>> the
>>>>> default settings--see below).  Rather it looks only for exact
>>>>> matches,
>>>>> that leave no overhang.  Thus:
>>>>>
>>>>>> Rpattern <- "CTGTAGGCACCA"
>>>>>
>>>>> trims:
>>>>>
>>>>> [6]    34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA
>>>>>
>>>>> nicely, to:
>>>>>
>>>>> [6]    22 GCTGGAACCCAGGGTGTTGTAC
>>>>>
>>>>>
>>>>> but a sequence where resulting in an Rpattern overhang (here ~2nt):
>>>>>
>>>>> [90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>>>>
>>>>> is not trimmed at all:
>>>>>
>>>>> [90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
>>>>>                                                    :
>>>>>
>>>>> What can I do to allow for flexibility at the overhanging end?
>>>>>
>>>>>
>>>>> Again, thanks very much.
>>>>> Joe
>>>>>
>>>>>
>>>>> On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:
>>>>>
>>>>> I just checked in a trimLRPatterns function to the Bioconductor svn
>>>>> repository for BioC 2.4. Its signature is
>>>>>
>>>>> trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
>>>>>             max.Lmismatch = 0, max.Rmismatch = 0,
>>>>>             with.Lindels = FALSE, with.Rindels = FALSE,
>>>>>             Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)
>>>>>
>>>>> As you can infer from the arguments, this function allows the
>>>>> user to
>>>>> set the # of mismatches (if with.*indels = FALSE) / edit distance
>>>>> (if
>>>>> with.*indels = TRUE) for the left and right flanking "patterns". It
>>>>> also allows for IUPAC ambiguity letters in these flanking regions
>>>>> if
>>>>> *fixed = FALSE. When ranges = FALSE, trimLRPatterns returns the
>>>>> trimmed
>>>>> strings. When ranges = TRUE, it returns the ranges that you can
>>>>> use to
>>>>> trim the strings. Here are some examples:
>>>>>
>>>>>> Lpattern <- "TTCTGCTTG"
>>>>>> Rpattern <- "GATCGGAAG"
>>>>>> subject <- DNAString("TTCTGCTTGACGTGATCGGA")
>>>>>> subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG",
>>>>>> "TTCTGCTTGGATCGGAAG"))
>>>>>> trimLRPatterns(Lpattern = Lpattern, subject = subject)
>>>>> 11-letter "DNAString" instance
>>>>> seq: ACGTGATCGGA
>>>>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>>>>> subjectSet)
>>>>> A DNAStringSet instance of length 2
>>>>> width seq
>>>>> [1]    18 TGCTTGACGGCAGATCGG
>>>>> [2]     0
>>>>>> trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
>>>>> subjectSet,
>>>>> +                  ranges = TRUE)
>>>>> IRanges object:
>>>>> start end width
>>>>> 1     1  18    18
>>>>> 2    10   9     0
>>>>>
>>>>> This functionality will be available on bioconductor.org (and
>>>>> downloadable via biocLite) in the next day or so, but you can
>>>>> also grab
>>>>> Biostrings from svn directly if you need it sooner. It will also
>>>>> feed
>>>>> its way into Biostrings documentation and training material
>>>>> before the
>>>>> next release of Bioconductor in May.
>>>>>
>>>>>
>>>>> Patrick
>>>>>
>>>>>
>>>>>
>>>>> Patrick Aboyoun wrote:
>>>>>> David,
>>>>>> Following up on Martin's comments, I am putting the finishing
>>>>>> touches on a function called trimLRPatterns for the Biostrings
>>>>>> package. Its purpose is to trim left and/or right flanking
>>>>>> patterns  from sequences, so it can strip 5' and/or 3' adapters
>>>>>> from your  reads. The signature for this function is
>>>>>>
>>>>>> trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject,
>>>>>> max.Lnedit=0,  max.Rnedit=0,
>>>>>>            with.Lindels=FALSE, with.Rindels=FALSE,  Lfixed=TRUE,
>>>>>> Rfixed=TRUE,
>>>>>>            rangesOnly = FALSE)
>>>>>>
>>>>>> I will be checking this function into the BioC 2.4 code line,
>>>>>> which  requires using R-devel, sometime today or tomorrow. I
>>>>>> will  send out  an e-mail to this group when I check it in and
>>>>>> show a  simple  example of its usage. I talked with Martin and
>>>>>> he will  wrap this  functionality in the ShortRead layer so you
>>>>>> don't have  to leave the  ShortRead class system when removing
>>>>>> adapters from  your reads.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Patrick
>>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list