[BioC] trim adapter with mismatch tolerance

Mon Dec 19 20:43:21 CET 2011

Harris, the followings are my code and intention and thanks a lot for your help!

For read 2, it will be trimmed
For read 3, there is a deletion at the 4th bp, CTG-CC, it was not trimmed by my code, but I wanted to
For read 4, there are two mismatches starting from the 4th bp: CTGCAC, comparing with CTGGCC. I wanted this to be trimmed too, but it did not

-Kunbin

Reads input:

@SEA055486C_0096:7:1101:2898:2204#ACAGTG/1
CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCCCGCAGAGGGTTGTATTGGTTCGGCACCATGCCGCTCTGCAGCCGGGACAGCCACTCGCA
+
gggggiiihihiiiiiiiiiiiihiiihiiiiiiigeecW`cc\`aabcdcccccccccccccccccaaccccccccccaaccccccccbcc]
@SEA055486C_0096:7:1101:3620:2209#ACAGTG/1
CTGGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC
+
ccccc_ad`[``ed_bRP^aZbccbbc\_cccccchhhdbb`bM\`\``bbc]V^RZ^____\^\^W[YY^^^^
@SEA055486C_0096:7:1101:3316:2429#ACAGTG/1
CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG
+
gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi[ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]a
@SEA055486C_0096:7:1101:3316:2439#ACAGTG/1
CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG
+
gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi[ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]aa

code:

adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC")
reads<-readFastq("t2.fq")
seqs<-sread(reads)
trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch = 1, ranges=T)
seqsAR <- DNAStringSet(seqs, start=start(trimCoordsAR), end=end(trimCoordsAR))
seqsAR
  A DNAStringSet instance of length 5
    width seq
[1]    93 CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGCA
[2]    49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC
[3]    92 CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCA...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG
[4]    93 CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCC...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG
> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] ShortRead_1.12.3    latticeExtra_0.6-19 RColorBrewer_1.0-5 
[4] Rsamtools_1.6.3     lattice_0.20-0      Biostrings_2.22.0  
[7] GenomicRanges_1.6.4 IRanges_1.12.5     

loaded via a namespace (and not attached):
 [1] Biobase_2.14.0     bitops_1.0-4.1     BSgenome_1.22.0    grid_2.14.0       
 [5] hwriter_1.3        RCurl_1.8-0        rtracklayer_1.14.4 tools_2.14.0      
 [9] XML_3.6-2          zlibbioc_1.0.0    
>

-----Original Message-----
From: Harris A. Jaffee [mailto:hj at jhu.edu] 
Sent: Monday, December 19, 2011 4:58 AM
To: Kunbin Qu
Cc: bioconductor at r-project.org
Subject: Re: [BioC] trim adapter with mismatch tolerance

Please give a read example and exactly what you want to happen and not happen, and, optionally, exactly you tried and what did happen, along with the output of sessionInfo() in case that may be relevant.  The spec and examples in the Biostrings::trimLRPatterns help page are not ideal.

You are correct that a single integer in max.Lmismatch should prevent matches/trimming by anything but the entire Lpattern.

I can't tell whether you are saying that you need with.Lindels=TRUE or Lfixed="pattern".

On Dec 18, 2011, at 11:38 PM, Kunbin Qu wrote:

> Hi,
>
> I have some reads with 25 bp adapter at the five prime end. Some of 
> the reads do not have perfect match with the adapter, even with some 
> indels.  I tried to use trimLRPatterns to trim them off, but I got 
> confused with the max.Lmismatch parameter, as it is doing sub- string 
> trimming too. The adapter in my reads are  the full length of 25 bp, 
> without shortening to any substring. How should I specify the 
> parameter then? Or maybe I could use some other functions from 
> ShortRead? Please help. Thanks.
>
> -Kunbin
>
> code I used, but this code cannot accommodate the mismatches.
>
> reads<-readFastq("test.fq")
> adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC")
> seqs<-sread(reads)
> trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs, 
> max.Lmismatch =1, ranges=T)
>
> ______________________________________________________________________
> The contents of this electronic message, including any attachments, 
> are intended only for the use of the individual or entity to which 
> they are addressed and may contain confidential information. If you 
> are not the intended recipient, you are hereby notified that any use, 
> dissemination, distribution, or copying of this message or any 
> attachment is strictly prohibited. If you have received this 
> transmission in error, please send an e-mail to 
> postmaster at genomichealth.com and delete this message, along with any 
> attachments, from your computer.
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/ 
> gmane.science.biology.informatics.conductor

______________________________________________________________________
The contents of this electronic message, including any attachments, are intended only for the use of the individual or entity to which they are addressed and may contain confidential information. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this message or any attachment is strictly prohibited. If you have received this transmission in error, please send an e-mail to postmaster at genomichealth.com and delete this message, along with any attachments, from your computer.