[Bioc-sig-seq] short read low complexity filtering

Thu Dec 9 00:03:46 CET 2010

Hi Robert,

Sorry for the late answer. I'm cc'ing the Bioc-sig-seq mailing list
so my answer will be archived and searchable.

On 12/02/2010 12:52 PM, R. T. Sweeney wrote:
> Dear Herve,
> My name is Robert Sweeney, a Stanford post-doc, new to R, looking
> forward to attending the R/bioconductor workshop Dec 9-10.
> I have a question about the R dustmasker-like script you suggested for
> removing low complexity short reads from fastq files.
> Here is the thread you wrote on it:
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-February/000170.html
>
> I am also new to the dustmasker type scoring. I am wondering if you
> could help me understand better. Attached is the output from the script.
> Could you briefly explain what the scores mean and about where I should
> choose a cut-off value?

I'm not a dustmasker expert. In fact I've never used it. There is a
paper about the DUST implementation here:

http://www.ncbi.nlm.nih.gov/pubmed/16796549

It seems that DUST has been used within BLAST for many years. But
because it uses a sliding windows of length 64 nucleotides my
understanding is that it is better suited for masking portions of a
genome than for scoring the complexity of short reads.

Anyway, the basic idea behind DUST is simple, intuitive, and can
be easily adapted to work on short reads. The basic idea is that
regions (or short reads) with a poor tri-nucleotide content (i.e.
with a small number of *distinct* tri-nucleotides) are considered
to have a low complexity. For example those sequences have a low
complexity:

   AAAAAAAAAA     contains only 1 tri-nculeotide: AAA
   ATATATATATAT   contains 2 tri-nucleotide: ATA and TAT

In the other hands, regions (or short reads) with a rich tri-nucleotide
content (i.e. many distinct tri-nucleotides) are considered to have
a high complexity. For example, the following 36-mer

   GGGCTACATGACGGTCCTGTATTTAGCCAGAGGATC

has the highest complexity a 36-mer can have because all the
tri-nucleotides contained in it (34 in total) are distinct.

There are many ways to implement this idea. The one I chose is
to penalize reads for which some tri-nucleotides are represented
more than once. This is achieved by this part of the code:

    tnf2 <- tnf - 1L
    tnf2[tnf2 < 0] <- 0L
    scores <- rowSums(tnf2 * tnf2)

where 'tnf' is a matrix containing the tri-nucleotide counts
for each short read. The matrix has 1 row per short read and
64 columns (i.e. one column per each possible tri-nucleotide).

The final 'scores' vector is just a numeric vector of length the
number of short reads.

As for about where to choose a cut-off value, it all depends on
the kind of data you have, what you want to do with it, etc...
it's really up to you. I would try different values and for each
of them I would look at:

   reads[scores >= cut-off]

where 'reads' is the DNAStringSet object containing your reads.
Start using the highest score as the cut-off and see what you
get (you should get only poly-As, poly-Cs, poly-Gs and poly-Ts).
Then try with lower values until you are satisfied. This is of
course a pragmatic approach with no serious statistical background :-)

> Also, how would we generate the two output files:
> name_low_complexity_removed.fastq and low_complexity_reads.fastq?

You could try to use write.XStringSet(...  , format="fastq") for this.
Or use the IO facilities of the ShortRead package. It depends what you
want to do with the scores. Alternatively you can also generate FASTA
files and try to convert them to FASTQ files with an external tool.

>
> Thanks for any help you can provide and look forward to meeting you at
> the workshop.

See you there.

Cheers,
H.

>
> Thanks,
> Robert Sweeney

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319