[R] analyze amino acid sequence (composition)of proteins

Mon Jun 19 20:53:33 CEST 2006

>i am a newer to R and i am doing some protein category classification based on
>the amino acid sequence.while i have some questions urgently.

I'm not sure to understand exactly what you are trying to do, but if
this is a kind of clustering from the amino-acid composition of
proteins beware that the genomic G+C content is a major confounding
factor.

>1. any packages for analysis amino acid sequence

There are some utilities in the seqinR package to retrieve protein
sequences from various protein databases such as SwissProt. The
list of available databases (including nucleic databases) is given by:

   library(seqinr)
   choosebank()
  [1] "genbank"      "embl"         "emblwgs"      "swissprot"    "ensembl"
  [6] "ensemblnew"   "emglib"       "nrsub"        "nbrf"         "hobacnucl"
[11] "hobacprot"    "hovernucl"    "hoverprot"    "hogennucl"    "hogenprot"
[16] "hoverclnu"    "hoverclpr"    "homolensprot" "homolensnucl" "HAMAPnucl"
[21] "HAMAPprot"    "hoppsigen"    "nurebnucl"    "nurebprot"    "taxobacgen"
[26] "greview"      "hogendnucl"   "hogendprot"   "refseq"

More info with the infobank argument set to TRUE:

   choosebank(infobank = TRUE)[1:5,]
        bank status 
info
1   genbank     on             GenBank Rel. 154 (15 June 2006) Last 
Updated: Jun 18, 2006
2      embl     on                                   EMBL Library 
Release 86 (March 2006)
3   emblwgs     on           EMBL Whole Genome Shotgun sequences 
Release 86  (March 2006)
4 swissprot     on UniProt Rel. 7 (SWISS-PROT 49 + TrEMBL 32): Last 
Updated: Jun  6, 2006
5   ensembl     on 
Ensembl databases rel 24

There are also some tools to compute protein indices such as the
Kyte and Doolittle hydrophaty index, or the PI (Isoelectric point
of a protein). See the examples in this document:
http://pbil.univ-lyon1.fr/software/SeqinR/seqinr_1_0-4.pdf

Most protein indices are linear forms on amino acid
relative frequencies whose coefficients are available in the
AAindex (http://www.genome.jp/aaindex/).

>2. given two sequences "AAA" and "BBB",how can i combine them into "AAABBB"

paste() is your friend:

   paste("AAA", "BBB", sep = "")
[1] "AAABBB"

Or define your own infix binary operator for convenience:

   "%+%" <- function(...) paste(..., sep = "")
   "AAA" %+% "BBB" %+% "CCC"
[1] "AAABBBCCC"

>3. based on "AAABBB",how can i get some statistics of this string such as how
>many letters,how many "A"s in the string.

  table(s2c("AAABBB"))

A B
3 3

With a real protein sequence:

  prot <- read.fasta(File = system.file("sequences/seqAA.fasta", 
package = "seqinr"), seqtype = "AA")
  table(prot)
prot
  *  A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
  1  8  6  6 18  6  8  1  9 14 29  5  7 10  9 13 16  7  6  3  1

If you want the three letter code instead:

   table(prot) -> tmp
   names(tmp) <- aaa(names(tmp))
   tmp
Stp Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser 
Thr Val Trp Tyr
   1   8   6   6  18   6   8   1   9  14  29   5   7  10   9  13  16 
7   6   3   1

Try also:

AAstat(prot[[1]])

Hope this helps,

-- 
Jean R. Lobry            (lobry at biomserv.univ-lyon1.fr)
Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - LYON I,
43 Bd 11/11/1918, F-69622 VILLEURBANNE CEDEX, FRANCE
allo  : +33 472 43 12 87     fax    : +33 472 43 13 88
http://pbil.univ-lyon1.fr/members/lobry/