[Bioc-sig-seq] Adapter removal

Krys Kelly kak28 at cam.ac.uk
Thu Jul 17 17:18:08 CEST 2008


Hi Harris

Thanks for this.  I started doing something like this in Perl.  

Regards

Krys


-----Original Message-----
From: Harris A. Jaffee [mailto:hj at jhu.edu] 
Sent: 17 July 2008 16:00
To: Krys Kelly
Cc: bioc-sig-sequencing at r-project.org
Subject: Re: [Bioc-sig-seq] Adapter removal

On Jul 17, 2008, at 9:47 AM, Krys Kelly wrote:
> I have inherited a pipeline for Solexa sequence data ...  This  
> seems pretty
> crude

> could anyone point me in the direction of some code?


[ This may now be silly, but I will send it anyway. ]

On (quasi) UNIX, you can use something like the script below, which  
elides the
largest initial portion of our 3' adapter and then the largest  
terminal portion
of our 5' adapter, minimal length 3 in both cases, from a data file  
where the
read (possibly containing ".") was the 5th field.

To allow 1 error, (using awk, for example) you can blow each pattern  
in there
up into a regular expression as in this sed edit command (the \'s  
don't belong
but were necessary for some reason):

's,.CGTATGCCGTCTTCTGCTTG$\|T.GTATGCCGTCTTCTGCTTG$\| 
TC.TATGCCGTCTTCTGCTTG$\|TCG.ATGCCGTCTTCTGCTTG$\|TCGT.TGCCGTCTTCTGCTTG$ 
\|TCGTA.GCCGTCTTCTGCTTG$\|TCGTAT.CCGTCTTCTGCTTG$\| 
TCGTATG.CGTCTTCTGCTTG$\|TCGTATGC.GTCTTCTGCTTG$\|TCGTATGCC.TCTTCTGCTTG$ 
\|TCGTATGCCG.CTTCTGCTTG$\|TCGTATGCCGT.TTCTGCTTG$\| 
TCGTATGCCGTC.TCTGCTTG$\|TCGTATGCCGTCT.CTGCTTG$\|TCGTATGCCGTCTT.TGCTTG$ 
\|TCGTATGCCGTCTTC.GCTTG$\|TCGTATGCCGTCTTCT.CTTG$\| 
TCGTATGCCGTCTTCTG.TTG$\|TCGTATGCCGTCTTCTGC.TG$\|TCGTATGCCGTCTTCTGCT.G$ 
\|TCGTATGCCGTCTTCTGCTT.$,>,'

In our data, the whole 3' adapter may occur at the 5' end, making the  
read
worthless, or in the interior!

All of this probably translates into R, but this was quicker for me.   
-Harris

--------------
#!/bin/sh

DIR=Processed

fgrep -v . |
awk '{print $5}' |
sed \
-e 's,TCGTATGCCGTCTTCTGCTTG$,>,' \
-e 's,TCGTATGCCGTCTTCTGCTT$,>,' \
-e 's,TCGTATGCCGTCTTCTGCT$,>,' \
-e 's,TCGTATGCCGTCTTCTGC$,>,' \
-e 's,TCGTATGCCGTCTTCTG$,>,' \
-e 's,TCGTATGCCGTCTTCT$,>,' \
-e 's,TCGTATGCCGTCTTC$,>,' \
-e 's,TCGTATGCCGTCTT$,>,' \
-e 's,TCGTATGCCGTCT$,>,' \
-e 's,TCGTATGCCGTC$,>,' \
-e 's,TCGTATGCCGT$,>,' \
-e 's,TCGTATGCCG$,>,' \
-e 's,TCGTATGCC$,>,' \
-e 's,TCGTATGC$,>,' \
-e 's,TCGTATG$,>,' \
-e 's,TCGTAT$,>,' \
-e 's,TCGTA$,>,' \
-e 's,TCGT$,>,' \
-e 's,TCG$,>,' \
-e 's,^GTTCAGAGTTCTACAGTCCGACGATC,<,' \
-e 's,^TTCAGAGTTCTACAGTCCGACGATC,<,' \
-e 's,^TCAGAGTTCTACAGTCCGACGATC,<,' \
-e 's,^CAGAGTTCTACAGTCCGACGATC,<,' \
-e 's,^AGAGTTCTACAGTCCGACGATC,<,' \
-e 's,^GAGTTCTACAGTCCGACGATC,<,' \
-e 's,^AGTTCTACAGTCCGACGATC,<,' \
-e 's,^GTTCTACAGTCCGACGATC,<,' \
-e 's,^TTCTACAGTCCGACGATC,<,' \
-e 's,^TCTACAGTCCGACGATC,<,' \
-e 's,^CTACAGTCCGACGATC,<,' \
-e 's,^TACAGTCCGACGATC,<,' \
-e 's,^ACAGTCCGACGATC,<,' \
-e 's,^CAGTCCGACGATC,<,' \
-e 's,^AGTCCGACGATC,<,' \
-e 's,^GTCCGACGATC,<,' \
-e 's,^TCCGACGATC,<,' \
-e 's,^CCGACGATC,<,' \
-e 's,^CGACGATC,<,' \
-e 's,^GACGATC,<,' \
-e 's,^ACGATC,<,' \
-e 's,^CGATC,<,' \
-e 's,^GATC,<,' \
-e 's,^ATC,<,' | tr -d '<>' |
         awk 'length>15' | sort | uniq -c |
         awk '{print >DIR "/" length($2)}' DIR=$DIR



More information about the Bioc-sig-sequencing mailing list