[Bioc-sig-seq] Adapter removal

Cei Abreu-Goodger cei at sanger.ac.uk
Thu Jul 17 18:08:59 CEST 2008


I also do something like that in perl...

Regarding MAQ, back when I tried it it could _not_ deal with linker 
sequences in any useful way... not for small RNA runs at least. It might 
have improved though... and it is _really_ fast.

Cheers,

Cei

On Thu, 17 Jul 2008, Krys Kelly wrote:

> Hi Harris
>
> Thanks for this.  I started doing something like this in Perl.
>
> Regards
>
> Krys
>
>
> -----Original Message-----
> From: Harris A. Jaffee [mailto:hj at jhu.edu]
> Sent: 17 July 2008 16:00
> To: Krys Kelly
> Cc: bioc-sig-sequencing at r-project.org
> Subject: Re: [Bioc-sig-seq] Adapter removal
>
> On Jul 17, 2008, at 9:47 AM, Krys Kelly wrote:
>> I have inherited a pipeline for Solexa sequence data ...  This
>> seems pretty
>> crude
>
>> could anyone point me in the direction of some code?
>
>
> [ This may now be silly, but I will send it anyway. ]
>
> On (quasi) UNIX, you can use something like the script below, which
> elides the
> largest initial portion of our 3' adapter and then the largest
> terminal portion
> of our 5' adapter, minimal length 3 in both cases, from a data file
> where the
> read (possibly containing ".") was the 5th field.
>
> To allow 1 error, (using awk, for example) you can blow each pattern
> in there
> up into a regular expression as in this sed edit command (the \'s
> don't belong
> but were necessary for some reason):
>
> 's,.CGTATGCCGTCTTCTGCTTG$\|T.GTATGCCGTCTTCTGCTTG$\|
> TC.TATGCCGTCTTCTGCTTG$\|TCG.ATGCCGTCTTCTGCTTG$\|TCGT.TGCCGTCTTCTGCTTG$
> \|TCGTA.GCCGTCTTCTGCTTG$\|TCGTAT.CCGTCTTCTGCTTG$\|
> TCGTATG.CGTCTTCTGCTTG$\|TCGTATGC.GTCTTCTGCTTG$\|TCGTATGCC.TCTTCTGCTTG$
> \|TCGTATGCCG.CTTCTGCTTG$\|TCGTATGCCGT.TTCTGCTTG$\|
> TCGTATGCCGTC.TCTGCTTG$\|TCGTATGCCGTCT.CTGCTTG$\|TCGTATGCCGTCTT.TGCTTG$
> \|TCGTATGCCGTCTTC.GCTTG$\|TCGTATGCCGTCTTCT.CTTG$\|
> TCGTATGCCGTCTTCTG.TTG$\|TCGTATGCCGTCTTCTGC.TG$\|TCGTATGCCGTCTTCTGCT.G$
> \|TCGTATGCCGTCTTCTGCTT.$,>,'
>
> In our data, the whole 3' adapter may occur at the 5' end, making the
> read
> worthless, or in the interior!
>
> All of this probably translates into R, but this was quicker for me.
> -Harris
>
> --------------
> #!/bin/sh
>
> DIR=Processed
>
> fgrep -v . |
> awk '{print $5}' |
> sed \
> -e 's,TCGTATGCCGTCTTCTGCTTG$,>,' \
> -e 's,TCGTATGCCGTCTTCTGCTT$,>,' \
> -e 's,TCGTATGCCGTCTTCTGCT$,>,' \
> -e 's,TCGTATGCCGTCTTCTGC$,>,' \
> -e 's,TCGTATGCCGTCTTCTG$,>,' \
> -e 's,TCGTATGCCGTCTTCT$,>,' \
> -e 's,TCGTATGCCGTCTTC$,>,' \
> -e 's,TCGTATGCCGTCTT$,>,' \
> -e 's,TCGTATGCCGTCT$,>,' \
> -e 's,TCGTATGCCGTC$,>,' \
> -e 's,TCGTATGCCGT$,>,' \
> -e 's,TCGTATGCCG$,>,' \
> -e 's,TCGTATGCC$,>,' \
> -e 's,TCGTATGC$,>,' \
> -e 's,TCGTATG$,>,' \
> -e 's,TCGTAT$,>,' \
> -e 's,TCGTA$,>,' \
> -e 's,TCGT$,>,' \
> -e 's,TCG$,>,' \
> -e 's,^GTTCAGAGTTCTACAGTCCGACGATC,<,' \
> -e 's,^TTCAGAGTTCTACAGTCCGACGATC,<,' \
> -e 's,^TCAGAGTTCTACAGTCCGACGATC,<,' \
> -e 's,^CAGAGTTCTACAGTCCGACGATC,<,' \
> -e 's,^AGAGTTCTACAGTCCGACGATC,<,' \
> -e 's,^GAGTTCTACAGTCCGACGATC,<,' \
> -e 's,^AGTTCTACAGTCCGACGATC,<,' \
> -e 's,^GTTCTACAGTCCGACGATC,<,' \
> -e 's,^TTCTACAGTCCGACGATC,<,' \
> -e 's,^TCTACAGTCCGACGATC,<,' \
> -e 's,^CTACAGTCCGACGATC,<,' \
> -e 's,^TACAGTCCGACGATC,<,' \
> -e 's,^ACAGTCCGACGATC,<,' \
> -e 's,^CAGTCCGACGATC,<,' \
> -e 's,^AGTCCGACGATC,<,' \
> -e 's,^GTCCGACGATC,<,' \
> -e 's,^TCCGACGATC,<,' \
> -e 's,^CCGACGATC,<,' \
> -e 's,^CGACGATC,<,' \
> -e 's,^GACGATC,<,' \
> -e 's,^ACGATC,<,' \
> -e 's,^CGATC,<,' \
> -e 's,^GATC,<,' \
> -e 's,^ATC,<,' | tr -d '<>' |
>         awk 'length>15' | sort | uniq -c |
>         awk '{print >DIR "/" length($2)}' DIR=$DIR
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.



More information about the Bioc-sig-sequencing mailing list