[Bioc-sig-seq] adapter removal

Wed Jan 14 18:36:32 CET 2009

David,
Following up on Martin's comments, I am putting the finishing touches on 
a function called trimLRPatterns for the Biostrings package. Its purpose 
is to trim left and/or right flanking patterns from sequences, so it can 
strip 5' and/or 3' adapters from your reads. The signature for this 
function is

  trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject, max.Lnedit=0, 
max.Rnedit=0,
                 with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE, 
Rfixed=TRUE,
                 rangesOnly = FALSE)

I will be checking this function into the BioC 2.4 code line, which 
requires using R-devel, sometime today or tomorrow. I will send out an 
e-mail to this group when I check it in and show a simple example of its 
usage. I talked with Martin and he will wrap this functionality in the 
ShortRead layer so you don't have to leave the ShortRead class system 
when removing adapters from your reads.

Cheers,
Patrick

Martin Morgan wrote:
> "David A.G" <dasolexa at hotmail.com> writes:
>
>   
>> hi Martin, thanks for your quick answer.
>> I can't get round to the fact that using srdistance() will remove the whole
>> read, instead of just editing the adapter off and keeping the remaining
>> fragment of the read, which it is useful information. That is the difference I
>> see between using srdistance() and the approach using pairwiseAlignment(). The
>> second, as it is shown in the Biostrings vignette using DNAStringSets, will
>> only edit the adapter sequence from the read. I may be wrong but the problem I
>> find is that I cannot work back from a DNAStringSet to a ShortReadQ object
>> that will include quality calls for the remaining bases of the edited reads,
>> and thus ending up with a full clean ShortReadQ object that can be transformed
>> back to fastq format.
>>     
>
> Yes you're right the context I'm thinking of is that one wants to
> remove reads ('filter') that have adapters, rather than removing the
> adapter sequence ('edit'?). My thinking is that the reads are short
> and numerous anyway, so little information is lost even when 10's of
> thousands of reads are filtered, and the filtered reads are more
> suspect than normal anyway because of the presence of the
> adapter. This might be relatively naive.
>
> I think Patrick's plan is to handle editing more comprehensively, but
> there's still lots of flexibility. Here's a (limited testing!)
> function that creates a ShortReadQ object by editing out those reads
> with srdistance between 'x' and 'string' greater than 'threshold', for
> a window onto x from 'start' to 'end'.
>
> trim <-
>     function(x, string, threshold, start, end)
> {
>     ShortReadQ <- function(id, sread, quality, ...)
>         new("ShortReadQ", quality=quality, sread=sread, id=id, ...)
>     sreads <- sread(x)
>     d <- srdistance(DNAStringSet(sreads, start, end), string)[[1]]
>     drop <- d <= threshold
>     starts <- rep(1, length(x))
>     starts[drop] <- end + 1
>     ends <- width(sreads)
>     ShortReadQ(id(x),
>                DNAStringSet(sreads, starts, ends),
>                SFastqQuality(BStringSet(quality(quality(x)), starts, ends)))
> }
>
>   
>> example(readFastq)
>>     
> [snip]
>   
>> fq <- trim(rfq, "GAG", 0, 1, 3)
>> fq
>>     
> class: ShortReadQ
> length: 256 reads; width: 36 33 cycles
>   
>> sread(fq)[width(sread(fq)) == 33] # trimmed
>>     
>   A DNAStringSet instance of length 15
>      width seq
>  [1]    33 AAGTTAATGGATGAATTGGCACAATGCTACAAT
>  [2]    33 TTTGTATCTGTTACTGAGAAGTTAATGGATGAA
>  [3]    33 GCTTGCGTTTATGGTACGCTGGTCTTTGTATGT
>  [4]    33 GATAAATTATGTCTAATATTCAAACTTGCGCCG
>  [5]    33 AAATAAAAGTCTGAAACATGATTAAACTCCTAA
>  [6]    33 ATTATTTGTCTCCAGCCACTTAAGTGAGGTGAT
>  [7]    33 CAGAAGCAATACCGCCAGCAATAGCACCAAACA
>  [8]    33 TTTATTGCTGCCGTCATTGCTTATTATGTTCAT
>  [9]    33 CTTCTCGAGCTGCGCAAGGATAGGTCGGATTTT
> [10]    33 TTGTTCCATTCTTTAGCTCCTAGACCTTTATCA
> [11]    33 CGTATGCCGCATGACCTTTCCCATCTTGGCTTC
> [12]    33 TATCCTTTCCTTTATCAGCGGCAGACTTGCCCC
> [13]    33 GAAGCATCAGCACCAGCACGCTCCCAAGCATTA
> [14]    33 TGGTCGGCAGATTGCGCTAAACGGTCACATTAA
> [15]    33 TTGTTCCATTCTTTAGCTCCTAGACCTTTAGCA
>   
>> sread(rfq)[width(sread(fq)) == 33] # original
>>     
>   A DNAStringSet instance of length 15
>      width seq
>  [1]    36 GAGAAGTTAATGGATGAATTGGCACAATGCTACAAT
>  [2]    36 GAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAA
>  [3]    36 GAGGCTTGCGTTTATGGTACGCTGGTCTTTGTATGT
>  [4]    36 GAGGATAAATTATGTCTAATATTCAAACTTGCGCCG
>  [5]    36 GAGAAATAAAAGTCTGAAACATGATTAAACTCCTAA
>  [6]    36 GAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGAT
>  [7]    36 GAGCAGAAGCAATACCGCCAGCAATAGCACCAAACA
>  [8]    36 GAGTTTATTGCTGCCGTCATTGCTTATTATGTTCAT
>  [9]    36 GAGCTTCTCGAGCTGCGCAAGGATAGGTCGGATTTT
> [10]    36 GAGTTGTTCCATTCTTTAGCTCCTAGACCTTTATCA
> [11]    36 GAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTC
> [12]    36 GAGTATCCTTTCCTTTATCAGCGGCAGACTTGCCCC
> [13]    36 GAGGAAGCATCAGCACCAGCACGCTCCCAAGCATTA
> [14]    36 GAGTGGTCGGCAGATTGCGCTAAACGGTCACATTAA
> [15]    36 GAGTTGTTCCATTCTTTAGCTCCTAGACCTTTAGCA
>
>   
>> Regarding appending two ShortReadQ objects, the idea came up early in this
>> thread when I run out of computer memory and I was suggested to split the
>> original data so that I could do the adapter search in small pieces. I would
>> then have to put them back together. Also, another situation where needed is
>> merging two flowcell lanes (or more) coming from the same sample when you are
>> analyzing large genomes and do not have enough coverage on one single lane. I
>> guess this could be done outside R using cat.
>>     
>
> Both of these sound pretty reasonable; others on the list might have
> comments about the pros and cons of appending flow cell lanes. Thanks
> for clarifying.
>
> One typo fixed below...
>
> Martin
>
>   
>> Dave
>>     
>>> To: dasolexa at hotmail.com
>>> CC: casioqv at usermail.com; bioc-sig-sequencing at r-project.org
>>> Subject: Re: [Bioc-sig-seq] adapter removal
>>> From: mtmorgan at fhcrc.org
>>> Date: Wed, 14 Jan 2009 06:03:26 -0800
>>>
>>> Hi Dave --
>>>
>>> "David A.G" <dasolexa at hotmail.com> writes:
>>>
>>>       
>>>> Hi Martin and list,
>>>> going back to the adapter topic, I tried the srdistance() function and it
>>>> works really quick!
>>>>         
>>> yeah it's great; Herve Pages & Patrick Aboyoun deserve credit for this.
>>>
>>>       
>>>> I have a question regarding the edit distance. After having used tables()
>>>>         
>> I
>>     
>>>> find that the second top element is CTGGACTTTGTAGGATACCCTCGCTTTCCTGCTC
>>>>         
>> with
>>     
>>>> 890 occurrences. The edit distance for this sequence is of course 0. But
>>>>         
>> if I
>>     
>>>> try only part of it as an adapter, CTGGACTTTGTAGGATA, the lowest edit
>>>>         
>> distance
>>     
>>>> is 17 and there are 14758 reads with this sequence.
>>>> Would you just use 17 as the distance cutoff?
>>>>         
>>> If I wanted to find just those 890 occurrences in an object 'fq' with
>>> the adapter at the start, I'd use srdistance on a DNAStringSet created
>>> from just those nucleotides, along the lines of
>>>
>>>       
>>>> d <- srdistance(DNAStringSet(sread(fq), 1, 17), "CTGGACTTTGTAGGATA")
>>>> fg[d != 0]
>>>>         
>
> should be fg[d[[1]] != 0]
>
>   
>>> It's not really clear that the adapters will always be exactly at the
>>> beginning, etc., so this might just be a starting point. Creating a
>>> new DNAStringSet is not particularly memory- or CPU-intensive, so
>>> creating it 'on the fly' as in the above example should be
>>> reasonable. I think Patrick is working on a one-stop-shopping adapter
>>> removal function in Biostrings.
>>>
>>>       
>>>> Also,is it possible to merge two ShortReadQ objects, or, having subset an
>>>> ShortReadQ object in two by indexing, how do I put them back together?
>>>>         
>>> The ability to 'append' two ShortReadQ objects was added just
>>> yesterday
>>>
>>>       
>>>> append(fq, fq)
>>>>         
>>> It should be available in the 'devel' branch using biocLite after
>>> about 1pm Seattle time, all being well; look for ShortRead version >=
>>> 1.1.33. It's possible to do this with the released version of
>>> ShortRead, and I can provide off-line hints if you'd like, but I'd
>>> recommend working with the development version of R and these
>>> packages.
>>>
>>> I'm curious to know the use case here; the need to append hasn't come
>>> up in my own work.
>>>
>>> Martin
>>>
>>>       
>>>> Cheers,
>>>> Dave
>>>>         
>>>>> To: dasolexa at hotmail.com
>>>>> CC: casioqv at usermail.com; bioc-sig-sequencing at r-project.org
>>>>> Subject: Re: [Bioc-sig-seq] adapter removal
>>>>> From: mtmorgan at fhcrc.org
>>>>> Date: Fri, 9 Jan 2009 19:19:18 -0800
>>>>>
>>>>> "David A.G" <dasolexa at hotmail.com> writes:
>>>>>
>>>>>           
>>>>>> Jim, thanks for pointing that out, I hadn´t faced that issue yet. The
>>>>>>             
>>>> "reads" object I created only had the sequences and not the qualities, and
>>>> after editing the adapters it is interesting to only have the quality
>>>>         
>> calls
>>     
>>>> for the remaining bases.
>>>>         
>>>>>> Tyler and Martin, thanks for the information on writing back to fastq.
>>>>>>             
>>>>> There is a now an initial version of writeFastq in the 'devel' version
>>>>> of ShortRead. It is memory and time efficient; try
>>>>>
>>>>>           
>>>>>> example(writeFastq)
>>>>>> ?writeFastq
>>>>>>             
>>>>> please let me know if there are problems, or other missing parts of
>>>>> ShortRead.
>>>>>
>>>>> Martin
>>>>>
>>>>>           
>>>>>> Dave
>>>>>>
>>>>>>             
>>>>>>> Date: Thu, 8 Jan 2009 15:12:01 -0800
>>>>>>> From: casioqv at usermail.com
>>>>>>> To: bioc-sig-sequencing at r-project.org
>>>>>>> Subject: Re: [Bioc-sig-seq] adapter removal
>>>>>>>
>>>>>>> Here's another "quick and dirty" ShortReadQ-fastq exporter. It's very
>>>>>>>               
>>>> slow...
>>>>         
>>>>>>> exportFastaQ <- function(shortReadQObject, fileName) {
>>>>>>> ## dump ShortReadQ object data to FASTQ format
>>>>>>> entries <- length(shortReadQObject)
>>>>>>> i <- 1
>>>>>>> while (i <= entries){
>>>>>>> cat(append("@", as.character(id(shortReadQObject)[[i]])), sep="",
>>>>>>> file=fileName, append=TRUE)
>>>>>>> cat("\n", file=fileName, append=TRUE)
>>>>>>> cat(as.character(sread(shortReadQObject)[i]), sep="",
>>>>>>> file=fileName, append=TRUE)
>>>>>>> cat("\n", file=fileName, append=TRUE)
>>>>>>> cat(append("+", as.character(id(shortReadQObject)[[i]])), sep="",
>>>>>>> file=fileName, append=TRUE)
>>>>>>> cat("\n", file=fileName, append=TRUE)
>>>>>>> cat(as.character(quality(shortReadQObject)[[i]]), sep="",
>>>>>>> file=fileName, append=TRUE)
>>>>>>> cat("\n", file=fileName, append=TRUE)
>>>>>>> i <- i + 1
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> Sincerely,
>>>>>>> Tyler William H Backman
>>>>>>> Bioinformatics Analyst
>>>>>>> Institute for Integrative Genome Biology (IIGB)
>>>>>>> Department of Botany and Plant Sciences
>>>>>>> E-mail: tyler.backman at ucr.edu
>>>>>>> 1007 Noel T. Keen Hall
>>>>>>> University of California
>>>>>>> Riverside, CA 92521
>>>>>>>
>>>>>>>               
>>>>>>>> Hi Jim --
>>>>>>>>
>>>>>>>> "James W. MacDonald" <jmacdon at med.umich.edu> writes:
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> As a follow-up question, once you have removed the adapter
>>>>>>>>>                   
>> sequences
>>     
>>>>>>>>> is it then possible to write the ShortReadQ object back out in a
>>>>>>>>>                   
>> fastq
>>     
>>>>>>>>> format for aligning with e.g., bowtie? I know I can use
>>>>>>>>> write.XstringSet() to output in fasta format, but I would like to
>>>>>>>>>                   
>> feed
>>     
>>>>>>>>> bowtie the quality scores as well.
>>>>>>>>>                   
>>>>>>>> Actually a 'writeFastq' is on my 'do this week' list. In the mean
>>>>>>>>                 
>> time
>>     
>>>>>>>> my best (untested) suggestion is along the lines of
>>>>>>>>
>>>>>>>> n <- length(fq)
>>>>>>>> lines <- character(n)
>>>>>>>> lines[(seq_len(n)) * 4 - 3] <-
>>>>>>>> paste("@", as.character(id(fq)), sep="")
>>>>>>>> lines[(seq_len(n)) * 4 - 2] <-
>>>>>>>> as.character(sread(rfq1))
>>>>>>>> lines[(seq_len(n)) * 4 - 1] <-
>>>>>>>> paste("+", as.character(id(fq)), sep="")
>>>>>>>> lines[(seq_len(n)) * 4] <-
>>>>>>>> as.character(quality(quality(fq)))
>>>>>>>> writeLines(lines, "my.fastq")
>>>>>>>>
>>>>>>>> This will be unnecessarily memory intensive.
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> Or is the speed reasonably comparable to use the PDict functions in
>>>>>>>>> Biostrings to do the aligning?
>>>>>>>>>                   
>>>>>>>> Bowtie is very fast. One potential 'gotcha' is the orientation of
>>>>>>>> reads throughout the workflow -- Eland _sequence.txt files are the
>>>>>>>> reads as they come from the machine; MAQ (and I think Bowtie)
>>>>>>>>                 
>> aligned
>>     
>>>>>>>> files report reads that align to the '-' strand in their reverse
>>>>>>>> complement sequence, with qualities reversed to match the reverse
>>>>>>>> compelement read, and with the alignment position recorded at the
>>>>>>>> 'left-most' (rather than 5', say) end. Not that it will be a problem
>>>>>>>> in the current use case, just something to be aware of as you move
>>>>>>>> between tools.
>>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Jim
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Martin Morgan wrote:
>>>>>>>>>                   
>>>>>>>>>> "David A.G" <dasolexa at hotmail.com> writes:
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> Dear list,
>>>>>>>>>>>
>>>>>>>>>>> I have some experience with Bioconductor but am newbie to this
>>>>>>>>>>>                       
>> list
>>     
>>>>>>>>>>> and to NGS. I am trying to remove some adapters from my solexa
>>>>>>>>>>> s_N_sequence.txt file using Biostrings and ShortRead packages and
>>>>>>>>>>>                       
>>>> the
>>>>         
>>>>>>>>>>> vignettes. I managed to read in the text file and got to save the
>>>>>>>>>>> reads as follows
>>>>>>>>>>>
>>>>>>>>>>> fqpattern <- "s_4_sequence.txt"
>>>>>>>>>>> f4 <- file.path(analysisPath(sp), fqpattern)
>>>>>>>>>>> fq4 <- readFastq(sp, fqpattern)
>>>>>>>>>>> reads <- sread(fq4) #"reads" contains more than 4 million
>>>>>>>>>>>                       
>> 34-length
>>     
>>>>>>>>>>> fragments
>>>>>>>>>>>
>>>>>>>>>>> Having the following adapter sequence:
>>>>>>>>>>>
>>>>>>>>>>> adapter <- DNAString("ACGGATTGTTCAGT")
>>>>>>>>>>>
>>>>>>>>>>> I tried to mimic the example in the Biostring vignette as
>>>>>>>>>>>                       
>> follows:
>>     
>>>>>>>>>>> myAdapterAligns <- pairwiseAlignment(reads, adapter, type =
>>>>>>>>>>>                       
>>>> "overlap")
>>>>         
>>>>>>>>>>> but after more than two hours the process is still running.
>>>>>>>>>>>
>>>>>>>>>>> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24)
>>>>>>>>>>>                       
>> with
>>     
>>>>>>>>>>> 4Gb RAM, and I only have some 30Mb free RAM left. I found a
>>>>>>>>>>>                       
>> thread
>>     
>>>> on
>>>>         
>>>>>>>>>>> adapter removal but does not clear things much to me, since as
>>>>>>>>>>>                       
>> far
>>     
>>>> as
>>>>         
>>>>>>>>>>> I understood the option mentioned in the thread is not
>>>>>>>>>>>                       
>> appropriate
>>     
>>>>>>>>>>> (quote :(though apparently this is not entirely satisfactory, see
>>>>>>>>>>>                       
>>>> the
>>>>         
>>>>>>>>>>> second entry!)).
>>>>>>>>>>>
>>>>>>>>>>> Is this just a memory issue or am I doing something wrong? Shall
>>>>>>>>>>>                       
>> I
>>     
>>>>>>>>>>> leave the process to run for longer?
>>>>>>>>>>>                       
>>>>>>>>>> As a bit of a follow-up, a reasonable strategy is to figure out
>>>>>>>>>>                     
>> how
>>     
>>>>>>>>>> long / how much memory the calculation takes on smaller chunks of
>>>>>>>>>> data.
>>>>>>>>>> Also, here I 'clean' (remove reads with 'N's) and then calculate
>>>>>>>>>>                     
>> the
>>     
>>>>>>>>>> srdistance to your adapter on a lane with 5978132 reads in about a
>>>>>>>>>> minute
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> fq <- readFastq(sp, "s_1_sequence.txt")
>>>>>>>>>>> fq
>>>>>>>>>>>                       
>>>>>>>>>> class: ShortReadQ
>>>>>>>>>> length: 5978132 reads; width: 53 cycles
>>>>>>>>>>                     
>>>>>>>>>>> system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))
>>>>>>>>>>>                       
>>>>>>>>>> user system elapsed 60.960 0.000 60.961 I might then
>>>>>>>>>> tabulate the distances
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> table(dist[[1]])
>>>>>>>>>>>                       
>>>>>>>>>> and subset fq based on some criterion
>>>>>>>>>> fq[dist[[1]] > 4]
>>>>>>>>>> There's a second interface to this, too, using filters, e.g.,
>>>>>>>>>> fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]
>>>>>>>>>> Martin
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> TIA for your help,
>>>>>>>>>>>
>>>>>>>>>>> Dave
>>>>>>>>>>>
>>>>>>>>>>> _________________________________________________________________
>>>>>>>>>>> Show them the way! Add maps and directions to your party invites.
>>>>>>>>>>>
>>>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>>>>>                       
>>>>>>>>> --
>>>>>>>>> James W. MacDonald, M.S.
>>>>>>>>> Biostatistician
>>>>>>>>> Hildebrandt Lab
>>>>>>>>> 8220D MSRB III
>>>>>>>>> 1150 W. Medical Center Drive
>>>>>>>>> Ann Arbor MI 48109-5646
>>>>>>>>> 734-936-8662
>>>>>>>>>                   
>>>>>>>> --
>>>>>>>> Martin Morgan
>>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>
>>>>>>>> Location: Arnold Building M2 B169
>>>>>>>> Phone: (206) 667-2793
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> _______________________________________________
>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>               
>>>>>> _________________________________________________________________
>>>>>> Show them the way! Add maps and directions to your party invites.
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-sig-sequencing mailing list
>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>             
>>>>> --
>>>>> Martin Morgan
>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N.
>>>>> PO Box 19024 Seattle, WA 98109
>>>>>
>>>>> Location: Arnold Building M2 B169
>>>>> Phone: (206) 667-2793
>>>>>           
>>>>         
>> ------------------------------------------------------------------------------
>>     
>>>> Get news, entertainment and everything you care about at Live.com. [[Check
>>>>         
>> it
>>     
>>>> out!]]
>>>>         
>>> --
>>> Martin Morgan
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M2 B169
>>> Phone: (206) 667-2793
>>>       
>> ------------------------------------------------------------------------------
>>
>> check out the rest of the Windows Live™. More than mail–Windows Live™ goes way
>> beyond your inbox. More than messages
>>     
>
>