[Bioc-sig-seq] Low-complexity read filtering/trimming
Hervé Pagès
hpages at fhcrc.org
Mon Feb 23 22:24:40 CET 2009
A small correction to my previous algorithm used for computing the
scores. In my previous email I suggested this:
tnf <- trinucleotideFrequency2(dict0)
scores <- rowSums(tnf * tnf)
but I think it would be more accurate and closer to the DUST
approach to start penalizing a read when it contains triplets
with frequency >= 2 i.e. the first time a triplet occurs in a
read should not count. This leads to the following code:
tnf2 <- tnf - 1L
tnf2[tnf2 < 0] <- 0L
scores <- rowSums(tnf2 * tnf2)
> summary(scores)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 8.00 12.00 24.59 19.00 1024.00
Now whatever the length of the reads is, they have a chance
to obtain a score of 0, which happens when all triplets in
the read are distinct. This new score is more meaningful IMO.
H.
Hervé Pagès wrote:
> Hi Cei,
>
> Here is a suggestion for question 1).
>
> DUST seems to have been designed to mask regions in long DNA sequences.
> They count tri-nucleotide frequencies in a 64-nucleotide sliding windows
> so, strictly speaking, the algo doesn't work anymore with short reads.
> In addition your problem seems simpler to me, because, if I understand
> correctly, you don't want to mask regions within the reads, but just
> want to keep or drop the entire read based on its "complexity".
> The notion of "complexity" as defined by DUST could be reused though, by
> applying trinucleotideFrequency() to the DNAStringSet object containing
> the reads, replacing all coefficients in the resulting matrix by its
> square, applying rowSums() to it, and finally using the resulting score
> to filter the reads (high score meaning low complexity):
>
> library(ShortRead)
> rfq <- readFastq("some_path", pattern="s_1_sequence.txt")
> dict0 <- sread(rfq)
>
> tnf <- trinucleotideFrequency(dict0)
> scores <- rowSums(tnf * tnf)
>
> table(scores) # choose a cut-off value
> clean_dict <- dict0[scores < cut_off]
>
> There is one technical problem though: the top-level loop in
> trinucleotideFrequency() is still written in R so that won't work
> for millions of reads. The temporary workaround is to implement
> a vcountPDict-based version of trinucleotideFrequency():
>
> trinucleotideFrequency2 <- function(x)
> {
> all_triplets <- DNAStringSet(mkAllStrings(c("A", "C", "G", "T"), 3))
> pd <- PDict(all_triplets)
> t(vcountPDict(pd, x))
> }
>
> trinucleotideFrequency2() will take about 20 seconds on a
> 4-million 35-mers DNAStringSet object.
>
> I calculated the scores for a set of 4.47 millions unaligned
> 35-mers from a single lane of a real Solexa experiment and got:
>
> > summary(scores)
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 22.00 49.00 55.00 66.95 63.00 1089.00
>
> > dict0[scores >= 200]
> A DNAStringSet instance of length 63331
> width seq
> [1] 35 AAAAAACCAAAACCAAACAAATAAAAACCCCAAAT
> [2] 35 GATAGATAGATAGATAGATAGATAGGGTAGATAGG
> [3] 35 GTGTGTGTGTGTGTGTGTGTATGTGTGTGTGCGTT
> [4] 35 GGTAGGTAGGTAGGTCAGGTAGGTAGGTAGGTAGG
> [5] 35 TGGTTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
> [6] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> [7] 35 GTGTGTCTGTGTGTCTGTGTGTCTGTGTGTGTCTG
> [8] 35 GTATGTGTGCTTTGTGTGTGTGGTGTGTGTGTGGT
> [9] 35 GAGAGTGTGTGTGTGAGAGAGAGTGTGTGTGTGAG
> ... ... ...
> [63323] 35 AAAAAAAAAAAAAAAAAAAAACGAAGAAAAAAAAG
> [63324] 35 AAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAACA
> [63325] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> [63326] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> [63327] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> [63328] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> [63329] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> [63330] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> [63331] 35 AAAAAAAGAAAAAAAAAAAAAAAAGAGAAAAAAAA
>
> Note that 1089 is the score obtained by the poly-As, poly-Cs,
> etc... (1089 = 33^2, this is the highest possible score for a
> 35-mer).
>
> H.
>
>
> Cei Abreu-Goodger wrote:
>> Hi all,
>>
>> I've been playing around with some Solexa small-RNA reads using
>> ShortRead and Biostrings. I've used the 'trimLRPatterns' function to
>> remove adapter sequence, and I've been trying to remove low-complexity
>> sequences with 'srFilter'. I would first really like to congratulate
>> all the people involved for the great work. There are two situations
>> in which I would be grateful for some suggestions, though:
>>
>> 1) I have many "low-complexity" reads. Some are simply polyA, polyC,
>> etc. But some others are runs of "ATATAT" or "CACACACA", etc.
>> Previously I would have used "dust" on the command line to filter out
>> this kind of read in a fasta file. Any ideas on how to achieve similar
>> functionality in the ShortRead world?
>>
>> 2) For some reads I may have a "N-rich" patch inside the read, for
>> example:
>> AATAAAGTGCTTACAGTGNNNNTNNATNCAATACCG
>>
>> I would ideally like to trim of everything starting at the "N-rich"
>> part. I was trying to implement something with 'vmatchPattern', but if
>> I allow for mismatches (for a more flexible search) I will also get
>> hits starting before the run of Ns.
>>
>> Many thanks,
>>
>> Cei
>>
>>
>>
>> sessionInfo()
>>
>> R version 2.9.0 Under development (unstable) (2009-02-13 r47919)
>> i386-apple-darwin9.6.0
>>
>> locale:
>> C
>>
>> attached base packages:
>> [1] stats graphics grDevices datasets utils methods base
>>
>> other attached packages:
>> [1] ShortRead_1.1.39 lattice_0.17-20 BSgenome_1.11.9
>> Biostrings_2.11.28
>> [5] IRanges_1.1.38 Biobase_2.3.10
>>
>> loaded via a namespace (and not attached):
>> [1] Matrix_0.999375-20 grid_2.9.0
>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list