[Bioc-sig-seq] assess how many duplicated reads

Thomas Girke thomas.girke at ucr.edu
Fri Aug 12 06:59:41 CEST 2011


In addition to Martin's suggestions, you could also consider clustering
your reads by similarity with UCLUST or SEED. The latter was developed
for removing redundancies in NGS samples, diagnosing PCR bias in RNA-Seq
experiments and some other applications. At this point it has not been
integrated into an R package yet.

SEED
http://bioinformatics.oxfordjournals.org/content/early/2011/08/02/bioinformatics.btr447.abstract

UCLUST
http://bioinformatics.oxfordjournals.org/content/early/2010/08/12/bioinformatics.btq461.abstract

Thomas

On Thu, Aug 11, 2011 at 08:34:03PM -0700, Martin Morgan wrote:
> On 08/11/2011 09:50 AM, Kunbin Qu wrote:
> >Hi, I have some human single end RNA-seq runs on HiSeq. Can I have
> >some suggestions on how to assess how many duplicated reads out of
> >these libraries? I looked around srFilter() in ShortRead, but have
> >not had a clear thought on how to implement it? Should I use IRanges
> >as an alternative to assess the unique starting site after the
> >mapping? If so, what function do you suggest? I'd like to count reads
> >which map to the same location (even with some mismatches) as
> >duplicates. Thanks.
> 
> ShortRead::tables() could be used for exactly identical unaligned
> reads. ShortRead::occurrenceFilter is an implementation for
> non-gapped, aligned reads. For aligned reads with gaps I think
> you're on your own, but maybe GRanges::readGappedAlignments or
> Rsamtools::scanBam + the logic of ShortRead::occurrenceFilter would
> be a starting point. Perhaps your aligner has already flagged
> duplicate reads, in which case the 'flag' field available in
> scanBamParam and scanBam would be helpful.
> 
> Hope that is of some help.
> 
> Martin
> 
> 
> >
> >-Kunbin
> >
> >
> >
> >______________________________________________________________________
> >
> >
> The contents of this electronic message, including any attachments,
> are intended only for the use of the individual or entity to which
> they are addressed and may contain confidential information. If you
> are not the intended recipient, you are hereby notified that any
> use, dissemination, distribution, or copying of this message or any
> attachment is strictly prohibited. If you have received this
> transmission in error, please send an e-mail to
> postmaster at genomichealth.com and delete this message, along with any
> attachments, from your computer.
> >[[alternative HTML version deleted]]
> >
> >_______________________________________________ Bioc-sig-sequencing
> >mailing list Bioc-sig-sequencing at r-project.org
> >https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 
> -- 
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> 
> Location: M1-B861
> Telephone: 206 667-2793
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



More information about the Bioc-sig-sequencing mailing list