[Bioc-sig-seq] Low-complexity read filtering/trimming

Mon Feb 23 22:24:40 CET 2009

A small correction to my previous algorithm used for computing the
scores. In my previous email I suggested this:

   tnf <- trinucleotideFrequency2(dict0)
   scores <- rowSums(tnf * tnf)

but I think it would be more accurate and closer to the DUST
approach to start penalizing a read when it contains triplets
with frequency >= 2 i.e. the first time a triplet occurs in a
read should not count. This leads to the following code:

   tnf2 <- tnf - 1L
   tnf2[tnf2 < 0] <- 0L
   scores <- rowSums(tnf2 * tnf2)

   > summary(scores)
      Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      0.00    8.00   12.00   24.59   19.00 1024.00

Now whatever the length of the reads is, they have a chance
to obtain a score of 0, which happens when all triplets in
the read are distinct. This new score is more meaningful IMO.

H.

Hervé Pagès wrote:
> Hi Cei,
> 
> Here is a suggestion for question 1).
> 
> DUST seems to have been designed to mask regions in long DNA sequences.
> They count tri-nucleotide frequencies in a 64-nucleotide sliding windows
> so, strictly speaking, the algo doesn't work anymore with short reads.
> In addition your problem seems simpler to me, because, if I understand
> correctly, you don't want to mask regions within the reads, but just
> want to keep or drop the entire read based on its "complexity".
> The notion of "complexity" as defined by DUST could be reused though, by
> applying trinucleotideFrequency() to the DNAStringSet object containing
> the reads, replacing all coefficients in the resulting matrix by its
> square, applying rowSums() to it, and finally using the resulting score
> to filter the reads (high score meaning low complexity):
> 
>   library(ShortRead)
>   rfq <- readFastq("some_path", pattern="s_1_sequence.txt")
>   dict0 <- sread(rfq)
> 
>   tnf <- trinucleotideFrequency(dict0)
>   scores <- rowSums(tnf * tnf)
> 
>   table(scores)  # choose a cut-off value
>   clean_dict <- dict0[scores < cut_off]
> 
> There is one technical problem though: the top-level loop in
> trinucleotideFrequency() is still written in R so that won't work
> for millions of reads. The temporary workaround is to implement
> a vcountPDict-based version of trinucleotideFrequency():
> 
>   trinucleotideFrequency2 <- function(x)
>   {
>     all_triplets <- DNAStringSet(mkAllStrings(c("A", "C", "G", "T"), 3))
>     pd <- PDict(all_triplets)
>     t(vcountPDict(pd, x))
>   }
> 
> trinucleotideFrequency2() will take about 20 seconds on a
> 4-million 35-mers DNAStringSet object.
> 
> I calculated the scores for a set of 4.47 millions unaligned
> 35-mers from a single lane of a real Solexa experiment and got:
> 
>   > summary(scores)
>      Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>     22.00   49.00   55.00   66.95   63.00 1089.00
> 
>   > dict0[scores >= 200]
>     A DNAStringSet instance of length 63331
>           width seq
>       [1]    35 AAAAAACCAAAACCAAACAAATAAAAACCCCAAAT
>       [2]    35 GATAGATAGATAGATAGATAGATAGGGTAGATAGG
>       [3]    35 GTGTGTGTGTGTGTGTGTGTATGTGTGTGTGCGTT
>       [4]    35 GGTAGGTAGGTAGGTCAGGTAGGTAGGTAGGTAGG
>       [5]    35 TGGTTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
>       [6]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>       [7]    35 GTGTGTCTGTGTGTCTGTGTGTCTGTGTGTGTCTG
>       [8]    35 GTATGTGTGCTTTGTGTGTGTGGTGTGTGTGTGGT
>       [9]    35 GAGAGTGTGTGTGTGAGAGAGAGTGTGTGTGTGAG
>       ...   ... ...
>   [63323]    35 AAAAAAAAAAAAAAAAAAAAACGAAGAAAAAAAAG
>   [63324]    35 AAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAACA
>   [63325]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>   [63326]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>   [63327]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>   [63328]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>   [63329]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>   [63330]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>   [63331]    35 AAAAAAAGAAAAAAAAAAAAAAAAGAGAAAAAAAA
> 
> Note that 1089 is the score obtained by the poly-As, poly-Cs,
> etc... (1089 = 33^2, this is the highest possible score for a
> 35-mer).
> 
> H.
> 
> 
> Cei Abreu-Goodger wrote:
>> Hi all,
>>
>> I've been playing around with some Solexa small-RNA reads using 
>> ShortRead and Biostrings. I've used the 'trimLRPatterns' function to 
>> remove adapter sequence, and I've been trying to remove low-complexity 
>> sequences with 'srFilter'. I would first really like to congratulate 
>> all the people involved for the great work. There are two situations 
>> in which I would be grateful for some suggestions, though:
>>
>> 1) I have many "low-complexity" reads. Some are simply polyA, polyC, 
>> etc. But some others are runs of "ATATAT" or "CACACACA", etc. 
>> Previously I would have used "dust" on the command line to filter out 
>> this kind of read in a fasta file. Any ideas on how to achieve similar 
>> functionality in the ShortRead world?
>>
>> 2) For some reads I may have a "N-rich" patch inside the read, for 
>> example:
>> AATAAAGTGCTTACAGTGNNNNTNNATNCAATACCG
>>
>> I would ideally like to trim of everything starting at the "N-rich" 
>> part. I was trying to implement something with 'vmatchPattern', but if 
>> I allow for mismatches (for a more flexible search) I will also get 
>> hits starting before the run of Ns.
>>
>> Many thanks,
>>
>> Cei
>>
>>
>>
>> sessionInfo()
>>
>> R version 2.9.0 Under development (unstable) (2009-02-13 r47919)
>> i386-apple-darwin9.6.0
>>
>> locale:
>> C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices datasets  utils     methods   base
>>
>> other attached packages:
>> [1] ShortRead_1.1.39   lattice_0.17-20    BSgenome_1.11.9 
>> Biostrings_2.11.28
>> [5] IRanges_1.1.38     Biobase_2.3.10
>>
>> loaded via a namespace (and not attached):
>> [1] Matrix_0.999375-20 grid_2.9.0
>>
>>
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319