[Bioc-sig-seq] identifying a common motif in a set of sequences

Muino, Jose jose.muino at wur.nl
Tue Feb 9 14:07:17 CET 2010


Hi,

Perhaps you can try the "sub" function from R. Not sure if there is a
more efficient way, but it should work.

By the way, if you google the sequence (GGCCACGCGTCGACTAGTAC) you will
find it in several papers. I have the impression that sometimes it is
used as a primer for the generation of the first cDNA strand.

Dr. Jose M Muino
Plant Research International B.V.
P.O. Box 619, 6700 AP Wageningen, The Netherlands
Phone: +0317-481122.
E-mail: jose.muino at wur.nl
http://www.pri.wur.nl 
 

> -----Original Message-----
> From: bioc-sig-sequencing-bounces at r-project.org 
> [mailto:bioc-sig-sequencing-bounces at r-project.org] On Behalf 
> Of Johannes Rainer
> Sent: dinsdag 9 februari 2010 13:37
> To: bioc-sig-sequencing at r-project.org
> Subject: [Bioc-sig-seq] identifying a common motif in a set 
> of sequences
> 
> dear all,
> 
> I'm wondering if there is already a function implemented in 
> any Bioconductor package that allows to identify a common 
> sequence pattern in a set of sequences.
> 
> I'm asking this because in my ChIPseq data out of the 20 mio 
> reads only about 3 mio can be aligned to the (human) genome 
> (using bowtie), and, by looking at the sequences that can not 
> be aligned (see below), there seem to be certain sequence 
> patterns (like GGCCACGCGTCGACTAGTAC). Actually I have 
> absolutely no idea where these sequences could come from. 
> They are not adapter or primer sequences, since I've aligned 
> all adapter/primer sequences I've got from the provider 
> against these sequences.
> 
> Is there any way to extract common sequence patterns (like 
> the GGCCACGCGTCGACTAGTAC) in an automated manner form these sequences?
> besides that, did anybody experience the same problem?
> 
> bests, jo
> 
> 
>   A DNAStringSet instance of length 16196935
>            width seq
>        [1]    36 GGCCCCGCGTCGCCTAGTACTACATAAACAATGACC
>        [2]    36 GGCGATGACCTTCTTGTGACCGTTGTGCATGCCGNC
>        [3]    36 GTTTCCCAGTCACGGTCATGCTTCCTGTTTCCCAGC
>        [4]    36 GTTTCCCAGTCACGGTCGTCCTTTTATTCTGACCTG
>        [5]    36 GGCCACGCGTCGACTAGTACTTAAAAATATCGCACG
>        [6]    36 GGCCACGCGTCGACTAGTACAGAAAAGACCGTGACT
>        [7]    36 GGCCACGCGTCGACTAGTACAAAGGACATCACGCCG
>        [8]    36 GGCCACGCGTCGACTAGTACAGAGTAAACAACGACC
>        [9]    36 CAGTCACGGTCAAAAAATACATACTAAACACCTACT
>        ...   ... ...
> [16196927]    36 CAGTCACGGTCTGGCGGNATNNTTTTTGTACTAGTC
> [16196928]    36 TAGCCAGCCAAGCCAGCNAANNCAGCCATCCAGCCA
> [16196929]    36 GCGCCCCTGTCGCGGACNACNNGTAAGCAGCTCTCT
> [16196930]    36 ACTACACCCCTTAGCAANGANNATCTGAGCCTCCAT
> [16196931]    36 ACTACAAGCAAACAGTGNTCNNCTATGGTCCAGATC
> [16196932]    36 GCAGCCACGTCCCGATCNCCNNTTTGAGTGCGTGCG
> [16196933]    36 GGCCACGCGTCGACTAGNACNNCGAAAAATACGACC
> [16196934]    36 GGCCACGCGTCGACTAGTACNNAAAAAACAACGCCT
> [16196935]    36 AGTCACGGTCAAGTAACACANNAACAGAAAACCAAA
> 
> --
> Johannes Rainer, PhD
> Bioinformatics Group,
> Division Molecular Pathophysiology,
> Biocenter, Medical University Innsbruck, Fritz-Pregl-Str 
> 3/IV, 6020 Innsbruck, Austria and Tyrolean Cancer Research 
> Institute Innrain 66, 6020 Innsbruck, Austria
> 
> Tel.:     +43 512 570485 13
> Email:  johannes.rainer at i-med.ac.at
>            johannes.rainer at tcri.at
> URL:   http://bioinfo.i-med.ac.at
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 



More information about the Bioc-sig-sequencing mailing list