No, the N's are just guaranteed mismatches, unless 'L/R-fixed' make  
them wild-cards.

To prohibit matching when the pattern becomes "too short" in your  
opinion, give less
than nchar(adapter) values, and the rest become -1.  Or, give nchar 
(adapter) values
with some 0's at the beginning, to permit the shorter matches if they  
are exact.

# if 3 is too short
 > trimLRPatterns(Rpattern=adapter, subject=read3, max.Rmismatch=rep 
(2,nchar(adapter)-3))
[1] "TGTTGTACCTGTANGNACNA"

# if 3 is ok, but 2 is too short
 > trimLRPatterns(Rpattern=adapter, subject=read3, max.Rmismatch=rep 
(2,nchar(adapter)-2))
[1] "TGTTGTACCTGTANGNA"

You can fine tune your mismatch tolerances using a rate, as in agrep  
(except that it
is converted internally by floor instead of ceiling, and you can't  
split the allowed
edits into inserts/deletes/errors).  This amounts to giving some 0's  
at the beginning.

 > trimLRPatterns(Rpattern=adapter, subject=read3, max.Rmismatch=.2)
[1] "TGTTGTACCTGTANGNACNA"

 > trimLRPatterns(Rpattern=adapter, subject=read3, max.Rmismatch=.3)
[1] "TGTTGTACCT"

Just to be clear, your mismatch vector will usually be sorted/ 
increasing.  The larger,
and later, values correspond to matching of the larger portions of  
the pattern, somewhat
counter intuitively since the matching is (now, and as one would  
expect) attempted from
the inside out, hence using the later mismatch elements first.  It  
originally was done
from the outside in, using the earlier mismatch elements first, which  
explains the order
of the vector.  Of course, this is less efficient to the degree that  
adapter is present.

On Jan 27, 2010, at 9:48 PM, joseph wrote:

> Hi Harris
> I am still confused, I used the small case to illustrate the  
> mismatches. Does this work only if the mismatches are N's? How do  
> you avoid unintended trimming if the mismatch is not N?
> Thanks
> From: Harris A. Jaffee <hj@jhu.edu>
> To: joseph <jdsandjd@yahoo.com>
> Cc: bioc-sig-sequencing@r-project.org
> Sent: Wed, January 27, 2010 6:27:35 PM
> Subject: Re: [Bioc-sig-seq] ABout how to trim out the adaptor of  
> soleax short data!
>
> The problem is with DNAString(Set), which accepts certain lower  
> case letters
> but uppercases them -- as you see from the result of your  
> trimLRPatterns call
> with read3, which ends with "GAA".  Your a's turned into 'A', and  
> then "CAA"
> fuzzy-matched "GTA", with 2 errors.
>
> > DNAString("x")
> Error in .charToSharedRaw(x, start = start, end = end, width =  
> width,  :
>   key 120 not in lookup table
>
> > DNAString("a")
>   1-letter "DNAString" instance
> seq: A
>
> Your reads might as well have 'N' in place of your 'a', and then  
> everything
> works fine, and you see that the subject might as well be of type  
> character.
> If so, it's converted to a BString(Set) for processing, and then  
> the result
> is cast back to character type, unless you want 'ranges'.
>
> > adapter = "GTAGGCACCA"
>
> > read2 = "TGTTGTACCTGTAGGNACNA"
> > trimLRPatterns(Rpattern=adapter, subject=read2, max.Rmismatch=rep 
> (2,nchar(adapter)))
> [1] "TGTTGTACCT"
>
> > read3 = "TGTTGTACCTGTANGNACNA"
> > trimLRPatterns(Rpattern=adapter, subject=read3, max.Rmismatch=rep 
> (2,nchar(adapter)))
> [1] "TGTTGTACCTGTANGNA"
>
> > trimLRPatterns(Rpattern=adapter, subject=read3, max.Rmismatch=rep 
> (2,nchar(adapter)),
> 	ranges=TRUE)
> IRanges of length 1
>     start end width
> [1]     1  17    17
>
> On Jan 27, 2010, at 7:57 PM, joseph wrote:
>> Hello
>> In the example below read2 has 2 mismatches and the adapter was  
>> trimmed correctly. However, read3 has 3 mismatches (and thus is  
>> not supposed to be trimmed) lost its last 3 nucleotides. Can you  
>> please help me understand this issue?
>>
>> > adapter = "GTAGGCACCA"
>> > read2 <- DNAStringSet("TGTTGTACCTGTAGGaACaA")
>> > read3 <- DNAStringSet("TGTTGTACCTGTAaGaACaA")
>> >
>> > trimLRPatterns(Rpattern=adapter, subject=read2, max.Rmismatch=rep 
>> (2,nchar(adapter)))
>>   A DNAStringSet instance of length 1
>>     width seq
>> [1]    10 TGTTGTACCT
>> > trimLRPatterns(Rpattern=adapter, subject=read3, max.Rmismatch=rep 
>> (2,nchar(adapter)))
>>   A DNAStringSet instance of length 1
>>     width seq
>> [1]    17 TGTTGTACCTGTAAGAA
>> >
>>
>>
>> From: Harris A. Jaffee <hj@jhu.edu>
>> To: Hongtao Hu <hzh0005@auburn.edu>
>> Cc: bioc-sig-sequencing@r-project.org
>> Sent: Wed, January 27, 2010 3:40:39 PM
>> Subject: Re: [Bioc-sig-seq] ABout how to trim out the adaptor of  
>> soleax short data!
>>
>> Patrick's trimLRPatterns function in the Biostrings package.
>>
>> Beware that the allowed mismatches arguments, including the  
>> defaults of 0,
>> are turned into vectors of the same length as the relevant  
>> pattern.  If a
>> single integer is specified, including the default, it is turned  
>> into a
>> vector with many -1's at the beginning, preventing any partial  
>> matching of
>> the pattern.  So, the only possible trimming will be by the whole  
>> pattern,
>> assuming that it matches well enough.  But the presence of the  
>> whole adaptor
>> would be a rare event.  To permit arbitrary partial matching,  
>> exact or not,
>> you have to give a vector of the same length as the relevant  
>> pattern, e.g.
>> rep(e, nchar(pattern)), for whatever non-negative e you want to  
>> allow.  You
>> can do this separately for the right and left adaptors.
>>
>> On Jan 27, 2010, at 3:45 PM, Hongtao Hu wrote:
>>
>> > Hey, dear all,
>> > The adapotor in our dataset seems variable. Usually, how should  
>> it be trimmed out or which software would be used?
>> > The length of 3' and 5' adapotr are sperately over 20 nt, but  
>> the total length of reads is 39 nt.  I wondering if Anyone who  
>> ever did the similar job can share your experience? Appreciate!
>> >
>> >
>> > Bests,
>> > Hongtao Hu
>> > Department of Biological Sicences
>> > Auburn University
>> > Auburn, Al 36832
>> > cell phone: 334-524-7282
>> > Hongtao webpage: http://www.auburn.edu/%7Ehzh0005/
>> >
>> > _______________________________________________
>> > Bioc-sig-sequencing mailing list
>> > Bioc-sig-sequencing@r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>
>
>


	[[alternative HTML version deleted]]

