[BioC] trim adapter with mismatch tolerance
Harris A. Jaffee
hj at jhu.edu
Tue Dec 20 22:00:44 CET 2011
Another way, below.
On Dec 19, 2011, at 4:14 PM, Harris A. Jaffee wrote:
> See below.
>
> On Dec 19, 2011, at 2:43 PM, Kunbin Qu wrote:
>> Harris, the followings are my code and intention and thanks a lot
>> for your help!
>>
>> For read 2, it will be trimmed
>> For read 3, there is a deletion at the 4th bp, CTG-CC, it was not
>> trimmed by my code, but I wanted to
>> For read 4, there are two mismatches starting from the 4th bp:
>> CTGCAC, comparing with CTGGCC. I wanted this to be trimmed too,
>> but it did not
>>
>> -Kunbin
>>
>>
>> Reads input:
>>
>> @SEA055486C_0096:7:1101:2898:2204#ACAGTG/1
>> CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCCCGCAGAGGGTTGTATTGGTTCGGCACCATGCCGCT
>> CTGCAGCCGGGACAGCCACTCGCA
>> +
>> gggggiiihihiiiiiiiiiiiihiiihiiiiiiigeecW`cc
>> \`aabcdcccccccccccccccccaaccccccccccaaccccccccbcc]
>> @SEA055486C_0096:7:1101:3620:2209#ACAGTG/1
>> CTGGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTG
>> GGCTC
>> +
>> ccccc_ad`[``ed_bRP^aZbccbbc\_cccccchhhdbb`bM\`\``bbc]V^RZ^____\^\^W
>> [YY^^^^
>> @SEA055486C_0096:7:1101:3316:2429#ACAGTG/1
>> CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGG
>> GCTCCAGCAGCGGAGGGGCCCTG
>> +
>> gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi
>> [ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]a
>> @SEA055486C_0096:7:1101:3316:2439#ACAGTG/1
>> CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTG
>> GGCTCCAGCAGCGGAGGGGCCCTG
>> +
>> gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi
>> [ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]aa
>>
>>
>> code:
>>
>> adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC")
>> reads<-readFastq("t2.fq")
>> seqs<-sread(reads)
>> trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs,
>> max.Lmismatch = 1, ranges=T)
>> seqsAR <- DNAStringSet(seqs, start=start(trimCoordsAR), end=end
>> (trimCoordsAR))
>> seqsAR
>> A DNAStringSet instance of length 5
>> width seq
>> [1] 93
>> CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGC
>> A
>> [2] 49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC
>> [3] 92
>> CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCA...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCT
>> G
>> [4] 93
>> CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCC...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCT
>> G
>
> Basically, you just need (1) to explicitly allow indels, since
> unfortunately the
> default is not to allow them, and (2) your mismatch tolerance isn't
> large enough.
>
> To get your feet wet, start with some exploratory data analysis
> such as:
>
> > neditStartingAt(starting.at=1, pattern=adapter, subject=seqs)
> [,1] [,2] [,3] [,4]
> [1,] 16 0 15 2
>
> > neditStartingAt(starting.at=1, pattern=adapter, subject=seqs,
> with.indels=TRUE)
> [,1] [,2] [,3] [,4]
> [1,] 12 0 1 2
Or just use trimLRPatterns itself with very liberal matching, for
example
trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch = 6,
with.Lindels=TRUE)
If your reads can contain partial adaptors, set max.Lmismatch to a
"rate":
trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch = .2,
with.Lindels=TRUE)
> So then, aside from possible non-calls ("N"), of which there aren't
> any in these
> 4 reads, we can now see pretty much what we need at least for reads
> 2, 3, and 4:
>
> > trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch=2,
> with.Lindels=TRUE)
> A DNAStringSet instance of length 4
> width seq
> [1] 93
> CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGCA
> [2] 49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC
> [3] 67
> TCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG
> [4] 68
> GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG
>
> Read 1 could be more subtle, or (more likely, I guess) just does
> not contain any
> adapter to trim in the first place.
>
>>> sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>> other attached packages:
>> [1] ShortRead_1.12.3 latticeExtra_0.6-19 RColorBrewer_1.0-5
>> [4] Rsamtools_1.6.3 lattice_0.20-0 Biostrings_2.22.0
>> [7] GenomicRanges_1.6.4 IRanges_1.12.5
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0
>> grid_2.14.0
>> [5] hwriter_1.3 RCurl_1.8-0 rtracklayer_1.14.4
>> tools_2.14.0
>> [9] XML_3.6-2 zlibbioc_1.0.0
>>>
>>
>> -----Original Message-----
>> From: Harris A. Jaffee [mailto:hj at jhu.edu]
>> Sent: Monday, December 19, 2011 4:58 AM
>> To: Kunbin Qu
>> Cc: bioconductor at r-project.org
>> Subject: Re: [BioC] trim adapter with mismatch tolerance
>>
>> Please give a read example and exactly what you want to happen and
>> not happen, and, optionally, exactly you tried and what did
>> happen, along with the output of sessionInfo() in case that may be
>> relevant. The spec and examples in the Biostrings::trimLRPatterns
>> help page are not ideal.
>>
>> You are correct that a single integer in max.Lmismatch should
>> prevent matches/trimming by anything but the entire Lpattern.
>>
>> I can't tell whether you are saying that you need
>> with.Lindels=TRUE or Lfixed="pattern".
>>
>> On Dec 18, 2011, at 11:38 PM, Kunbin Qu wrote:
>>
>>> Hi,
>>>
>>> I have some reads with 25 bp adapter at the five prime end. Some of
>>> the reads do not have perfect match with the adapter, even with some
>>> indels. I tried to use trimLRPatterns to trim them off, but I got
>>> confused with the max.Lmismatch parameter, as it is doing sub-
>>> string
>>> trimming too. The adapter in my reads are the full length of 25 bp,
>>> without shortening to any substring. How should I specify the
>>> parameter then? Or maybe I could use some other functions from
>>> ShortRead? Please help. Thanks.
>>>
>>> -Kunbin
>>>
>>> code I used, but this code cannot accommodate the mismatches.
>>>
>>> reads<-readFastq("test.fq")
>>> adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC")
>>> seqs<-sread(reads)
>>> trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs,
>>> max.Lmismatch =1, ranges=T)
>>>
>>> ____________________________________________________________________
>>> __
>>> The contents of this electronic message, including any attachments,
>>> are intended only for the use of the individual or entity to which
>>> they are addressed and may contain confidential information. If you
>>> are not the intended recipient, you are hereby notified that any
>>> use,
>>> dissemination, distribution, or copying of this message or any
>>> attachment is strictly prohibited. If you have received this
>>> transmission in error, please send an e-mail to
>>> postmaster at genomichealth.com and delete this message, along with any
>>> attachments, from your computer.
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/
>>> gmane.science.biology.informatics.conductor
>>
>>
>> _____________________________________________________________________
>> _
>> The contents of this electronic message, including any
>> attachments, are intended only for the use of the individual or
>> entity to which they are addressed and may contain confidential
>> information. If you are not the intended recipient, you are hereby
>> notified that any use, dissemination, distribution, or copying of
>> this message or any attachment is strictly prohibited. If you have
>> received this transmission in error, please send an e-mail to
>> postmaster at genomichealth.com and delete this message, along with
>> any attachments, from your computer.
>
More information about the Bioconductor
mailing list