[Bioc-sig-seq] filtering by adapters in QA report

Sun Mar 27 23:49:52 CEST 2011

Just regarding your remarks on trimLRPatterns / vmatchPattern ...

I don't know how to approach partial adaptors, but I think non- 
flanking whole
adaptors can be handled essentially by trimLRPatterns.  That is, a  
front-end
can alter your adaptor and mismatch limits for you, then call  
trimLRPatterns.

Here, 3 N's are prepended to the adaptor, and Lfixed set to "subject":

 > trimNonFlankingPatterns(Lpattern="TTT", subject=DNAStringSet(c 
("ATTTCG","AATTTC")))
   A DNAStringSet instance of length 2
     width seq
[1]     2 CG
[2]     1 C

Here, the max.Lmismatch vector must be enlarged; the simplest way is  
just
to replicate the last element (which is done here, 3 times in this  
case):

 > trimNonFlankingPatterns(Lpattern="TTT", subject=DNAStringSet(c 
("ATATCG","AATTCG")), max.Lmismatch=1)
   A DNAStringSet instance of length 2
     width seq
[1]     2 CG
[2]     1 G

On Mar 25, 2011, at 11:59 PM, Marcus Davy wrote:

> Hi Robert,
> just to add to the discussion, it was not initially obvious to me  
> that the
> ShortRead QA report can read either from disk
> or from a ShortReadQ object within R. This at least provides the  
> flexibility
> to filter a ShortReadQ object using
> trimLRPatterns/vmatchPattern/narrow etc and then run a QA filtered  
> report to
> get more meaningful plots.
>
> I agree it would be a nice feature to be able to specify some adapter
> sequences to filter in a qa() call itself, or potentially
> select parts of the report of interest.
> There will be cases that will test this proposed functionality,  
> especially
> around partial adapter sequence
> and the number of mismatches to allow for. I recently came across a
> synthetic construct (~20 bases) in an
> illumina experiment which was the first half of an adapter with the  
> addition
> of a single random DNA base at
> the 5' start, so the partial adapter effectively started at cycle  
> position 2
> of the subjects. Using Biostrings
> trimLRPatterns may not identify this pattern and dynamically trim  
> or filter
> (utilizing ranges coordinates)
> unless the random base is added to the start of the pattern and at  
> least one
> mismatch is allowed,
> whereas using a vmatchPattern approach to filter would work.
>
> Marcus
>
>
> On Sat, Mar 26, 2011 at 5:41 AM, Robert Gentleman  
> <rgentlem at gmail.com>wrote:
>
>> On Fri, Mar 25, 2011 at 8:59 AM, Martin Morgan  
>> <mtmorgan at fhcrc.org> wrote:
>>> On 03/24/2011 10:56 AM, Michael Lawrence wrote:
>>>>
>>>> Hi Martin,
>>>>
>>>> It would be nice if the ShortRead QA report could somehow filter  
>>>> out the
>>>> adapter contamination before generating the rest of its plots,  
>>>> since
>> those
>>>> plots are pretty meaningless if there are adapters present.
>>>>
>>>> Not sure how to handle this filtering in general. That is, what if
>> someone
>>>> then wants to see plots with only the "high quality" reads after  
>>>> the
>>>> quality
>>>> plots. It gets complicated. ShortRead has a nice filtering  
>>>> mechanism,
>> but
>>>> this is more complicated, since some QA plots come from one filter,
>> while
>>>> others come from a different stage.
>>>>
>>>> However, under the assumption that no one would ever want to  
>>>> align an
>>>> adapter, i.e., those reads will not be carried forward, the adapter
>>>> removal
>>>> could just be treated specially hard-coded. And then just expect  
>>>> more
>>>> customized solutions to leverage the internal ShortRead  
>>>> functions for
>>>> generating each slot in the QA object, building it up  
>>>> incrementally, on
>>>> different subsets. Of course, to make sense, that would require a
>>>> different
>>>> report template, too.
>>>
>>> Hi Michael -- Yes it would be nice to be able to more flexibly  
>>> control
>> how
>>> different components of the report are generated, or at least to  
>>> make
>> some
>>> smarter choices along the lines you suggest for adapter  
>>> contaminants.
>> It's
>>> hard to know how to make this really general, but I have come across
>> other
>>> situations where I'd like to cherry-pick which parts of the QA  
>>> process I
>>> want to perform. I think I need some standardization on function
>> signatures
>>> for generating each report section, tighter description of  
>>> results from
>> each
>>> section (i.e., a formal class  hierarchy), and then a flexible  
>>> report
>>> composition. It seems like quite a big task; I wonder if there  
>>> are good
>>> models out there to follow? arrayQualityMetrics?
>>
>>   I think arrayQualityMetrics is a good starting place.  Audrey and
>> Wolfgang have
>> done a good job of modularizing the components.  But there are still
>> hiccups - which
>> suggests just how hard that is.  And as you suggested, it was a  
>> big job.
>>
>>  I think the case Michael is bringing up might be useful to deal  
>> with,
>> without
>> a major rewrite.  There should be some sort of file that ShortRead  
>> has
>> access to
>> (or an input parameter) that gives some more details on the  
>> samples and on
>> the
>> processing (eg what the sample labels should be, and what the  
>> adapters etc
>> are).
>> Then this information could be used in the current paradigm.
>>
>> Mostly the issue is that if you have adapter contamination then the
>> subsequent plots
>> (eg nucleotide by cycle) are not useful.  You cannot see anything in
>> them and then
>> you have to go back and strip adapters by hand, then rerun ShortRead.
>> I agree that
>> you may want more general filtering, as an abundance of any read will
>> affect the plots,
>> but I think there is agreement that one would never want to include
>> the adapters (you do want
>> counts as are produced now, but given their affect on the graphics
>> filtering would be
>> beneficial).
>>
>>  best wishes
>>    Robert
>>>
>>> Martin
>>>
>>>>
>>>> Michael
>>>>
>>>>        [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>> --
>>> Computational Biology
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>
>>> Location: M1-B861
>>> Telephone: 206 667-2793
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>>
>>
>> --
>> Robert Gentleman
>> rgentlem at gmail.com
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing