[Bioc-sig-seq] Adapter removal

Thu Jul 17 18:35:12 CEST 2008

"Krys Kelly" <kak28 at cam.ac.uk> writes:

> I have inherited a pipeline for Solexa sequence data using Perl, Bioperl,
> SSAHA and mySQL.  As an R/Bioconducter user I am interested in ShortRead and
> BiostringsCinterfaceDemo.
>
> However, in the short term I need to use the current pipeline.  The imaging
> is done by the Sequencing Facility and we get fastq files with the 3'
> adapter still attached. The adapter removal is currently done by a Perl
> script which just keeps sequences which match any number of letters in
> [ACGT] followed by the first 8 letters of the adapter.  This seems pretty
> crude (e.g. only using 8 letters, not allowing for mismatches, not allowing
> for the diminishing quality along the length of the read).

In the very latest Biostrings, and thanks to Herve's skills, you can

> library(ShortRead)
> sp <- SolexaPath("/path/to/solexa/expt")
> pat <- "s_1_0001_seq.txt" # or any regexp to read multiple files
> seq <- readFastq(analysisPath(sp), patreadXStringColumns(baseCallPath(sp), pat,
+                           list(NULL, NULL, NULL, NULL, "DNAString"))[[1]]
> seq <- clean(seq) # no reads with uncalled bases

(or readFastq on analysisPath(sp), "s_1_sequence.txt", for instance,
but these are 'filtered' and it's not really clear that that is the
right place for a careful analysis to start), and then

> tables(seq)$top[1:5]
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGAT 
                                 186                                  126 
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAT 
                                  71                                   51 
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAA 
                                  16 
to remind me what the primer sequence looks like (!) and

> pd <- PDict(seq, max.mismatch=3)
> subj <- DNAString("GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA")
> cnts <- countPDict(pd, subj, max.mismatch=3)

cnts now contains the number of times each read in seq matched the
primer with up to 3 mismatches (could be more, within reason), of
which there are

> sum(cnts!=0)
[1] 565

seq[cnts==0] gets me the non-primer sequences (though apparently this
is not entirely satisfactory, see the second entry!)

> tables(seq[cnts==0])$top[1:5]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GATCGGAAGAGCTCGGTATGCCGTCTTCTGCTTAGA 
                                  71                                   12 
GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG 
                                   5                                    5 
CACCTTTTTCAGTTTTCCTCGCCATATTTCACGTCC 
                                   4 

sessionInfo() tells me

other attached packages:
[1] ShortRead_0.1.32  lattice_0.17-10   Biostrings_2.9.42 Biobase_2.1.7    

> Google has not revealed any algorithms or code for this part of the
> pipeline.  Does anyone know what algorithms are being used or, even better,
> could anyone point me in the direction of some code?

I guess you're asking in general. In terms of ELAND, I'm not really
sure what it does about primers; I think it actually keeps them in,
relying on failure to align to weed them out. With contaminants (which
can be specified in a separate file) ELAND aligns the reads to the
contaminants and discards matches. The code is 'Solexa Open Source' or
some such, and should be available from your friendly Solexa rep; the
contaminant stuff is in Gerald/GERALD.pl of the version of the code I
have access to.

There are some obvious additional problems, e.g., those poly-A
reads. Tools in Biostrings (e.g., alphabetFrequency) can be used in
these cases, too.

All of this seems pretty ad hoc, which is not really what you were
looking for, I suspect.

Martin

> Thanks
>
> Krys
>
>
> Dr Krystyna A Kelly
> Senior Research Associate
> David Baulcombe Group
>
> Department of Plant Sciences
> University of Cambridge
> Downing Street
> Cambridge CB2 3EA
> United Kingdom
>
> Tel: +44 (0)1223 333 915
> Fax: +44 (0)1223 333953
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793