[Bioc-sig-seq] Adapter removal

Thu Jul 17 15:47:27 CEST 2008

I have inherited a pipeline for Solexa sequence data using Perl, Bioperl,
SSAHA and mySQL.  As an R/Bioconducter user I am interested in ShortRead and
BiostringsCinterfaceDemo.

However, in the short term I need to use the current pipeline.  The imaging
is done by the Sequencing Facility and we get fastq files with the 3'
adapter still attached. The adapter removal is currently done by a Perl
script which just keeps sequences which match any number of letters in
[ACGT] followed by the first 8 letters of the adapter.  This seems pretty
crude (e.g. only using 8 letters, not allowing for mismatches, not allowing
for the diminishing quality along the length of the read).

Google has not revealed any algorithms or code for this part of the
pipeline.  Does anyone know what algorithms are being used or, even better,
could anyone point me in the direction of some code?

Thanks

Krys

Dr Krystyna A Kelly
Senior Research Associate
David Baulcombe Group

Department of Plant Sciences
University of Cambridge
Downing Street
Cambridge CB2 3EA
United Kingdom

Tel: +44 (0)1223 333 915
Fax: +44 (0)1223 333953