[Bioc-sig-seq] trimLRPatters: inconstitency re max.[LR]mismatch

Wed Feb 17 19:56:28 CET 2010

Absolutely!

On Feb 3, 2010, at 2:57 PM, Harris A. Jaffee wrote:
> <patch_for_trimLRPatterns>
> This is my solution.  It vastly simplifies things, supports indels  
> on both sides,
> and corrects the behavior of trimming based on non-flanking  
> matches.  All matching
> is now tied to the edge of the subject (i.e., starting.at=1 and  
> going forward, or
> ending.at = "the end" and extending back).  Trimming is now by the  
> length of the
> longest acceptably matching pattern rather than based on any match,  
> of which there
> can of course be many within the edit tolerance.  Setting the  
> tolerances with indels
> allowed is now clear enough from the man page that even I  
> understand it, whereas
> before I had no chance! ...
> I did NOT change the default R/L-mismatch defaults of 0, but I  
> wanted to.  I
> believe 0 should come under the realm of an "error rate" (currently  
> required to be
> >0 and <1).  So, I suggest that 0 becomes an all zero vector of the  
> length of the
> R/L-pattern, rather than a bunch of -1's plus one 0 at the end.

So, I would change the man page (and its supporting code) to  
something like

	A single numeric value in '[0, 1)' is taken as a "rate" ...

This would break backward compatibility, but I view it more as fixing  
a buggy
spec/implementation.

The indels story is a little different.  I view that as a faulty  
implementation,
which I believe I fixed, at least partially, of a good spec (trimming  
*flanking*
patterns).  "partially" gets complicated, so I won't explain here...

-Harris

On Feb 17, 2010, at 1:04 PM, Patrick Aboyoun wrote:
> Simon,
> Harris has been spending the most time thinking and working on this  
> issue. I am including him in this e-mail.
>
> Harris,
> Do you think we should change the default behavior? I know you  
> mentioned this is a previous e-mail and if users are not getting  
> the results they are expecting, perhaps now is the time to bite the  
> bullet (by breaking backwards compatibility) and change how 0 is  
> interpreted by the max.[LR]mismatch parameter.
>
>
> Patrick
>
>
> On 2/17/10 5:53 AM, Simon Anders wrote:
>> Hi Patrick
>>
>> I've just tried to trim adaptors of my Solexa reads and got a bit  
>> puzzled about trimLRPatterns's max.Rmismatch argument.
>>
>> Let's say I have two sequences as follows:
>>
>> > library(ShortRead)
>> [...]
>> > subject <- DNAString( "CGCCGGCCGAAATTTAA" )
>> > pattern <- DNAString( "AAATTTAAATTTAAATTT" )
>>
>> These have a clear overlapping alignment without mismatch:
>>
>>    CGCCGGCCGAAATTTAA <- subject
>>             AAATTTAAATTTAAATTT <- (right) pattern
>>
>>    CGCCGGCCG <- trimmed subject
>>
>> So, I would expect trimLRpattern to trim it to the given "trimmed  
>> subject" sequence. However:
>>
>> > trimLRPatterns( subject=subject, Rpattern=pattern )
>>   17-letter "DNAString" instance
>> seq: CGCCGGCCGAAATTTAA
>>
>> As you can see, the subject untrimmed.
>>
>> If I now specify allow for 10% mismatch, trimming works:
>>
>> > trimLRPatterns( subject=subject, Rpattern=pattern,  
>> max.Rmismatch=.1 )
>>   9-letter "DNAString" instance
>> seq: CGCCGGCCG
>>
>> Why do I need to allow for mismatches?
>>
>>
>> This here is even a bit stranger: COmpare the result of allowing  
>> for very small mismatches, say 10^-10, with exactly 0:
>>
>> > trimLRPatterns( subject=subject, Rpattern=pattern,
>> +    max.Rmismatch=1e-10 )
>>   9-letter "DNAString" instance
>> seq: CGCCGGCCG
>>
>> > trimLRPatterns( subject=subject, Rpattern=pattern,
>> +    max.Rmismatch=0 )
>>   17-letter "DNAString" instance
>> seq: CGCCGGCCGAAATTTAA
>>
>> This is a bit dis-continuous, isn't it? ;-)
>>
>>
>> Cheers
>>   Simon
>>
>> > sessionInfo()
>> R version 2.11.0 Under development (unstable) (--)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] ShortRead_1.5.15   lattice_0.17-26    BSgenome_1.15.4  
>> Biostrings_2.15.21
>> [5] IRanges_1.5.46
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.7.2 grid_2.11.0   hwriter_1.1   tools_2.11.0
>>
>>
>> +---
>> | Dr. Simon Anders, Dipl.-Phys.
>> | European Molecular Biology Laboratory (EMBL), Heidelberg
>> | office phone +49-6221-387-8632
>> | preferred (permanent) e-mail: sanders at fs.tum.de
>