[Bioc-sig-seq] filtering by adapters in QA report
Harris A. Jaffee
hj at jhu.edu
Sun Mar 27 23:49:52 CEST 2011
Just regarding your remarks on trimLRPatterns / vmatchPattern ...
I don't know how to approach partial adaptors, but I think non-
flanking whole
adaptors can be handled essentially by trimLRPatterns. That is, a
front-end
can alter your adaptor and mismatch limits for you, then call
trimLRPatterns.
Here, 3 N's are prepended to the adaptor, and Lfixed set to "subject":
> trimNonFlankingPatterns(Lpattern="TTT", subject=DNAStringSet(c
("ATTTCG","AATTTC")))
A DNAStringSet instance of length 2
width seq
[1] 2 CG
[2] 1 C
Here, the max.Lmismatch vector must be enlarged; the simplest way is
just
to replicate the last element (which is done here, 3 times in this
case):
> trimNonFlankingPatterns(Lpattern="TTT", subject=DNAStringSet(c
("ATATCG","AATTCG")), max.Lmismatch=1)
A DNAStringSet instance of length 2
width seq
[1] 2 CG
[2] 1 G
On Mar 25, 2011, at 11:59 PM, Marcus Davy wrote:
> Hi Robert,
> just to add to the discussion, it was not initially obvious to me
> that the
> ShortRead QA report can read either from disk
> or from a ShortReadQ object within R. This at least provides the
> flexibility
> to filter a ShortReadQ object using
> trimLRPatterns/vmatchPattern/narrow etc and then run a QA filtered
> report to
> get more meaningful plots.
>
> I agree it would be a nice feature to be able to specify some adapter
> sequences to filter in a qa() call itself, or potentially
> select parts of the report of interest.
> There will be cases that will test this proposed functionality,
> especially
> around partial adapter sequence
> and the number of mismatches to allow for. I recently came across a
> synthetic construct (~20 bases) in an
> illumina experiment which was the first half of an adapter with the
> addition
> of a single random DNA base at
> the 5' start, so the partial adapter effectively started at cycle
> position 2
> of the subjects. Using Biostrings
> trimLRPatterns may not identify this pattern and dynamically trim
> or filter
> (utilizing ranges coordinates)
> unless the random base is added to the start of the pattern and at
> least one
> mismatch is allowed,
> whereas using a vmatchPattern approach to filter would work.
>
> Marcus
>
>
> On Sat, Mar 26, 2011 at 5:41 AM, Robert Gentleman
> <rgentlem at gmail.com>wrote:
>
>> On Fri, Mar 25, 2011 at 8:59 AM, Martin Morgan
>> <mtmorgan at fhcrc.org> wrote:
>>> On 03/24/2011 10:56 AM, Michael Lawrence wrote:
>>>>
>>>> Hi Martin,
>>>>
>>>> It would be nice if the ShortRead QA report could somehow filter
>>>> out the
>>>> adapter contamination before generating the rest of its plots,
>>>> since
>> those
>>>> plots are pretty meaningless if there are adapters present.
>>>>
>>>> Not sure how to handle this filtering in general. That is, what if
>> someone
>>>> then wants to see plots with only the "high quality" reads after
>>>> the
>>>> quality
>>>> plots. It gets complicated. ShortRead has a nice filtering
>>>> mechanism,
>> but
>>>> this is more complicated, since some QA plots come from one filter,
>> while
>>>> others come from a different stage.
>>>>
>>>> However, under the assumption that no one would ever want to
>>>> align an
>>>> adapter, i.e., those reads will not be carried forward, the adapter
>>>> removal
>>>> could just be treated specially hard-coded. And then just expect
>>>> more
>>>> customized solutions to leverage the internal ShortRead
>>>> functions for
>>>> generating each slot in the QA object, building it up
>>>> incrementally, on
>>>> different subsets. Of course, to make sense, that would require a
>>>> different
>>>> report template, too.
>>>
>>> Hi Michael -- Yes it would be nice to be able to more flexibly
>>> control
>> how
>>> different components of the report are generated, or at least to
>>> make
>> some
>>> smarter choices along the lines you suggest for adapter
>>> contaminants.
>> It's
>>> hard to know how to make this really general, but I have come across
>> other
>>> situations where I'd like to cherry-pick which parts of the QA
>>> process I
>>> want to perform. I think I need some standardization on function
>> signatures
>>> for generating each report section, tighter description of
>>> results from
>> each
>>> section (i.e., a formal class hierarchy), and then a flexible
>>> report
>>> composition. It seems like quite a big task; I wonder if there
>>> are good
>>> models out there to follow? arrayQualityMetrics?
>>
>> I think arrayQualityMetrics is a good starting place. Audrey and
>> Wolfgang have
>> done a good job of modularizing the components. But there are still
>> hiccups - which
>> suggests just how hard that is. And as you suggested, it was a
>> big job.
>>
>> I think the case Michael is bringing up might be useful to deal
>> with,
>> without
>> a major rewrite. There should be some sort of file that ShortRead
>> has
>> access to
>> (or an input parameter) that gives some more details on the
>> samples and on
>> the
>> processing (eg what the sample labels should be, and what the
>> adapters etc
>> are).
>> Then this information could be used in the current paradigm.
>>
>> Mostly the issue is that if you have adapter contamination then the
>> subsequent plots
>> (eg nucleotide by cycle) are not useful. You cannot see anything in
>> them and then
>> you have to go back and strip adapters by hand, then rerun ShortRead.
>> I agree that
>> you may want more general filtering, as an abundance of any read will
>> affect the plots,
>> but I think there is agreement that one would never want to include
>> the adapters (you do want
>> counts as are produced now, but given their affect on the graphics
>> filtering would be
>> beneficial).
>>
>> best wishes
>> Robert
>>>
>>> Martin
>>>
>>>>
>>>> Michael
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>> --
>>> Computational Biology
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>
>>> Location: M1-B861
>>> Telephone: 206 667-2793
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>>
>>
>> --
>> Robert Gentleman
>> rgentlem at gmail.com
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list