[Bioc-sig-seq] filtering adaptors again

Cei Abreu-Goodger cei at ebi.ac.uk
Thu Mar 26 20:18:21 CET 2009


Hi Lana,

Have you tried using the "trimLRPatterns" function?

I usually start with a small part of the primer, say the first ~10 
bases, padded by Ns to make up a read of full length.

Cheers,

Cei

Lana Schaffer wrote:
> Hi,
> I have read Feb 2009 archives and have been trying to
> filter alot of primer reads to see what I short reads
> remaining.
> The small RNA primer (TCGTATGCCGTCTTCTGCTTG) attached to
> a series of A's is most contamination of the reads that
> I would like to filter.
> -------------------------------------------------------
> dist1 <- srdistance(clean(fq4), "TCGTATGCCGTCTTCTGCTTGAAAAAAAAAA")
> table(dist1[[1]])
>    4    5    6    7    8    9   10   11   12   13   14   15   16   17
> 18   19 
> 9338  789  406  121 2094  240  184   55  332   78   90   25   68   16
> 62   31 
>   20   21   22   23   24   25   26   28   29 
>  166  550  623  640  318   65    6    1    4 
> 
> f <- fq4[dist1[[1]] <5]
>    [1]    35 NTAGTACTCTGCGTTGTGGCCGCAGCCACCTCGGT
>    [2]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>    [3]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>    [4]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>    [5]    35 NCTGGACTTGGAGTCAGAAGATCTCGTATGCCGTC
>    [6]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>    [7]    35 GGTATGATTCTCGCATCTCGTATGCCGTCTTCTGC
>    [8]    35 GGTATGATTCTCGCATCTCGTATGCCGTCTCCTGC
>    [9]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>    ...   ... ...
> [9363]    35 TCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAA
> [9364]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
> [9365]    35 TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACAA
> [9366]    35 ATATAATACAACCTGCTAAGTGATCTCGTATGCCG
> [9367]    35 ATCTCGTATGCCGTCTTCTGCTTGACAAAAAAAAA
> [9368]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAACAACAA
> [9369]    35 ATCTCGTATGCCGTCTTCTGCTTGAACCACACAAA
> [9370]    35 GTATGCCGTCTTCTGCTTGAAAAAAAAAAAAACCA
> [9371]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
> 
> f <- fq4[dist1[[1]] >28]
> [1]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
> [2]    35 CGATCATCTCGTATGCCGTCTTCTGCTTGAAAAAA
> [3]    35 GTATGCCGTCTTCTGCTTGAAAAAAAAAAACAACC
> [4]    35 CAGCAATCTCGTATGCCGTCTTCTGCTTGAAAAAA
> ---------------------------------------------------------
> You can see that I am not doing a good filtering job.
> d<5 is showing some sequences free of primer that I would
> want to save. 
> I have tried the polyn function, but that does not work for me
> when I use a series of 10-20 A's (<35).  
> 
> Would someone be able to give me some suggestions?
> 
> 
> sessionInfo()
> R version 2.9.0 Under development (unstable) (2009-02-12 r47905) 
> i386-pc-mingw32 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] ShortRead_1.1.50   lattice_0.17-20    BSgenome_1.11.9
> Biostrings_2.11.42
> [5] IRanges_1.1.54    
> loaded via a namespace (and not attached):
> [1] Biobase_2.3.11     grid_2.9.0         hwriter_1.1
> Matrix_0.999375-20
> 
> 
> 
> Lana Schaffer
> Biostatistics/Informatics
> The Scripps Research Institute
> DNA Array Core Facility
> La Jolla, CA 92037
> (858) 784-2263
> (858) 784-2994
> schaffer at scripps.edu 
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.



More information about the Bioc-sig-sequencing mailing list