[BioC] trim adapter with mismatch tolerance

Tue Dec 20 22:00:44 CET 2011

Another way, below.

On Dec 19, 2011, at 4:14 PM, Harris A. Jaffee wrote:
> See below.
>
> On Dec 19, 2011, at 2:43 PM, Kunbin Qu wrote:
>> Harris, the followings are my code and intention and thanks a lot  
>> for your help!
>>
>> For read 2, it will be trimmed
>> For read 3, there is a deletion at the 4th bp, CTG-CC, it was not  
>> trimmed by my code, but I wanted to
>> For read 4, there are two mismatches starting from the 4th bp:  
>> CTGCAC, comparing with CTGGCC. I wanted this to be trimmed too,  
>> but it did not
>>
>> -Kunbin
>>
>>
>> Reads input:
>>
>> @SEA055486C_0096:7:1101:2898:2204#ACAGTG/1
>> CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCCCGCAGAGGGTTGTATTGGTTCGGCACCATGCCGCT 
>> CTGCAGCCGGGACAGCCACTCGCA
>> +
>> gggggiiihihiiiiiiiiiiiihiiihiiiiiiigeecW`cc 
>> \`aabcdcccccccccccccccccaaccccccccccaaccccccccbcc]
>> @SEA055486C_0096:7:1101:3620:2209#ACAGTG/1
>> CTGGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTG 
>> GGCTC
>> +
>> ccccc_ad`[``ed_bRP^aZbccbbc\_cccccchhhdbb`bM\`\``bbc]V^RZ^____\^\^W 
>> [YY^^^^
>> @SEA055486C_0096:7:1101:3316:2429#ACAGTG/1
>> CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGG 
>> GCTCCAGCAGCGGAGGGGCCCTG
>> +
>> gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi 
>> [ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]a
>> @SEA055486C_0096:7:1101:3316:2439#ACAGTG/1
>> CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTG 
>> GGCTCCAGCAGCGGAGGGGCCCTG
>> +
>> gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi 
>> [ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]aa
>>
>>
>> code:
>>
>> adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC")
>> reads<-readFastq("t2.fq")
>> seqs<-sread(reads)
>> trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs,  
>> max.Lmismatch = 1, ranges=T)
>> seqsAR <- DNAStringSet(seqs, start=start(trimCoordsAR), end=end 
>> (trimCoordsAR))
>> seqsAR
>>   A DNAStringSet instance of length 5
>>     width seq
>> [1]    93  
>> CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGC 
>> A
>> [2]    49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC
>> [3]    92  
>> CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCA...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCT 
>> G
>> [4]    93  
>> CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCC...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCT 
>> G
>
> Basically, you just need (1) to explicitly allow indels, since  
> unfortunately the
> default is not to allow them, and (2) your mismatch tolerance isn't  
> large enough.
>
> To get your feet wet, start with some exploratory data analysis  
> such as:
>
> > neditStartingAt(starting.at=1, pattern=adapter, subject=seqs)
>      [,1] [,2] [,3] [,4]
> [1,]   16    0   15    2
>
> > neditStartingAt(starting.at=1, pattern=adapter, subject=seqs,  
> with.indels=TRUE)
>      [,1] [,2] [,3] [,4]
> [1,]   12    0    1    2

Or just use trimLRPatterns itself with very liberal matching, for  
example

   trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch = 6,  
with.Lindels=TRUE)

If your reads can contain partial adaptors, set max.Lmismatch to a  
"rate":

   trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch = .2,  
with.Lindels=TRUE)

> So then, aside from possible non-calls ("N"), of which there aren't  
> any in these
> 4 reads, we can now see pretty much what we need at least for reads  
> 2, 3, and 4:
>
> > trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch=2,  
> with.Lindels=TRUE)
>   A DNAStringSet instance of length 4
>     width seq
> [1]    93  
> CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGCA
> [2]    49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC
> [3]    67  
> TCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG
> [4]    68  
> GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG
>
> Read 1 could be more subtle, or (more likely, I guess) just does  
> not contain any
> adapter to trim in the first place.
>
>>> sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> other attached packages:
>> [1] ShortRead_1.12.3    latticeExtra_0.6-19 RColorBrewer_1.0-5
>> [4] Rsamtools_1.6.3     lattice_0.20-0      Biostrings_2.22.0
>> [7] GenomicRanges_1.6.4 IRanges_1.12.5
>>
>> loaded via a namespace (and not attached):
>>  [1] Biobase_2.14.0     bitops_1.0-4.1     BSgenome_1.22.0     
>> grid_2.14.0
>>  [5] hwriter_1.3        RCurl_1.8-0        rtracklayer_1.14.4  
>> tools_2.14.0
>>  [9] XML_3.6-2          zlibbioc_1.0.0
>>>
>>
>> -----Original Message-----
>> From: Harris A. Jaffee [mailto:hj at jhu.edu]
>> Sent: Monday, December 19, 2011 4:58 AM
>> To: Kunbin Qu
>> Cc: bioconductor at r-project.org
>> Subject: Re: [BioC] trim adapter with mismatch tolerance
>>
>> Please give a read example and exactly what you want to happen and  
>> not happen, and, optionally, exactly you tried and what did  
>> happen, along with the output of sessionInfo() in case that may be  
>> relevant.  The spec and examples in the Biostrings::trimLRPatterns  
>> help page are not ideal.
>>
>> You are correct that a single integer in max.Lmismatch should  
>> prevent matches/trimming by anything but the entire Lpattern.
>>
>> I can't tell whether you are saying that you need  
>> with.Lindels=TRUE or Lfixed="pattern".
>>
>> On Dec 18, 2011, at 11:38 PM, Kunbin Qu wrote:
>>
>>> Hi,
>>>
>>> I have some reads with 25 bp adapter at the five prime end. Some of
>>> the reads do not have perfect match with the adapter, even with some
>>> indels.  I tried to use trimLRPatterns to trim them off, but I got
>>> confused with the max.Lmismatch parameter, as it is doing sub-  
>>> string
>>> trimming too. The adapter in my reads are  the full length of 25 bp,
>>> without shortening to any substring. How should I specify the
>>> parameter then? Or maybe I could use some other functions from
>>> ShortRead? Please help. Thanks.
>>>
>>> -Kunbin
>>>
>>> code I used, but this code cannot accommodate the mismatches.
>>>
>>> reads<-readFastq("test.fq")
>>> adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC")
>>> seqs<-sread(reads)
>>> trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs,
>>> max.Lmismatch =1, ranges=T)
>>>
>>> ____________________________________________________________________ 
>>> __
>>> The contents of this electronic message, including any attachments,
>>> are intended only for the use of the individual or entity to which
>>> they are addressed and may contain confidential information. If you
>>> are not the intended recipient, you are hereby notified that any  
>>> use,
>>> dissemination, distribution, or copying of this message or any
>>> attachment is strictly prohibited. If you have received this
>>> transmission in error, please send an e-mail to
>>> postmaster at genomichealth.com and delete this message, along with any
>>> attachments, from your computer.
>>> 	[[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/
>>> gmane.science.biology.informatics.conductor
>>
>>
>> _____________________________________________________________________ 
>> _
>> The contents of this electronic message, including any  
>> attachments, are intended only for the use of the individual or  
>> entity to which they are addressed and may contain confidential  
>> information. If you are not the intended recipient, you are hereby  
>> notified that any use, dissemination, distribution, or copying of  
>> this message or any attachment is strictly prohibited. If you have  
>> received this transmission in error, please send an e-mail to  
>> postmaster at genomichealth.com and delete this message, along with  
>> any attachments, from your computer.
>