[Bioc-sig-seq] Low-complexity read filtering/trimming
Cei Abreu-Goodger
cei at ebi.ac.uk
Mon Feb 23 01:23:07 CET 2009
Hi all,
I've been playing around with some Solexa small-RNA reads using
ShortRead and Biostrings. I've used the 'trimLRPatterns' function to
remove adapter sequence, and I've been trying to remove low-complexity
sequences with 'srFilter'. I would first really like to congratulate all
the people involved for the great work. There are two situations in
which I would be grateful for some suggestions, though:
1) I have many "low-complexity" reads. Some are simply polyA, polyC,
etc. But some others are runs of "ATATAT" or "CACACACA", etc. Previously
I would have used "dust" on the command line to filter out this kind of
read in a fasta file. Any ideas on how to achieve similar functionality
in the ShortRead world?
2) For some reads I may have a "N-rich" patch inside the read, for example:
AATAAAGTGCTTACAGTGNNNNTNNATNCAATACCG
I would ideally like to trim of everything starting at the "N-rich"
part. I was trying to implement something with 'vmatchPattern', but if I
allow for mismatches (for a more flexible search) I will also get hits
starting before the run of Ns.
Many thanks,
Cei
sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-13 r47919)
i386-apple-darwin9.6.0
locale:
C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] ShortRead_1.1.39 lattice_0.17-20 BSgenome_1.11.9
Biostrings_2.11.28
[5] IRanges_1.1.38 Biobase_2.3.10
loaded via a namespace (and not attached):
[1] Matrix_0.999375-20 grid_2.9.0
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Bioc-sig-sequencing
mailing list