[Bioc-sig-seq] filtering adaptors again

Martin Morgan mtmorgan at fhcrc.org
Thu Mar 26 20:27:57 CET 2009


"Lana Schaffer" <schaffer at scripps.edu> writes:

> Hi,
> I have read Feb 2009 archives and have been trying to
> filter alot of primer reads to see what I short reads
> remaining.
> The small RNA primer (TCGTATGCCGTCTTCTGCTTG) attached to
> a series of A's is most contamination of the reads that
> I would like to filter.
> -------------------------------------------------------
> dist1 <- srdistance(clean(fq4), "TCGTATGCCGTCTTCTGCTTGAAAAAAAAAA")
> table(dist1[[1]])
>    4    5    6    7    8    9   10   11   12   13   14   15   16   17
> 18   19 
> 9338  789  406  121 2094  240  184   55  332   78   90   25   68   16
> 62   31 
>   20   21   22   23   24   25   26   28   29 
>  166  550  623  640  318   65    6    1    4 
>
> f <- fq4[dist1[[1]] <5]

clean(fq4) != fq4, so if this is your code you're subsetting the wrong
object.

Martin

>    [1]    35 NTAGTACTCTGCGTTGTGGCCGCAGCCACCTCGGT
>    [2]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>    [3]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>    [4]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>    [5]    35 NCTGGACTTGGAGTCAGAAGATCTCGTATGCCGTC
>    [6]    35 NTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>    [7]    35 GGTATGATTCTCGCATCTCGTATGCCGTCTTCTGC
>    [8]    35 GGTATGATTCTCGCATCTCGTATGCCGTCTCCTGC
>    [9]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>    ...   ... ...
> [9363]    35 TCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAA
> [9364]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
> [9365]    35 TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACAA
> [9366]    35 ATATAATACAACCTGCTAAGTGATCTCGTATGCCG
> [9367]    35 ATCTCGTATGCCGTCTTCTGCTTGACAAAAAAAAA
> [9368]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAACAACAA
> [9369]    35 ATCTCGTATGCCGTCTTCTGCTTGAACCACACAAA
> [9370]    35 GTATGCCGTCTTCTGCTTGAAAAAAAAAAAAACCA
> [9371]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
>
> f <- fq4[dist1[[1]] >28]
> [1]    35 ATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
> [2]    35 CGATCATCTCGTATGCCGTCTTCTGCTTGAAAAAA
> [3]    35 GTATGCCGTCTTCTGCTTGAAAAAAAAAAACAACC
> [4]    35 CAGCAATCTCGTATGCCGTCTTCTGCTTGAAAAAA
> ---------------------------------------------------------
> You can see that I am not doing a good filtering job.
> d<5 is showing some sequences free of primer that I would
> want to save. 
> I have tried the polyn function, but that does not work for me
> when I use a series of 10-20 A's (<35).  
>
> Would someone be able to give me some suggestions?
>
>
> sessionInfo()
> R version 2.9.0 Under development (unstable) (2009-02-12 r47905) 
> i386-pc-mingw32 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.1.50   lattice_0.17-20    BSgenome_1.11.9
> Biostrings_2.11.42
> [5] IRanges_1.1.54    
> loaded via a namespace (and not attached):
> [1] Biobase_2.3.11     grid_2.9.0         hwriter_1.1
> Matrix_0.999375-20
>
>
>
> Lana Schaffer
> Biostatistics/Informatics
> The Scripps Research Institute
> DNA Array Core Facility
> La Jolla, CA 92037
> (858) 784-2263
> (858) 784-2994
> schaffer at scripps.edu 
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list