[R] analyze amino acid sequence (composition)of proteins
Jean lobry
lobry at biomserv.univ-lyon1.fr
Mon Jun 19 20:53:33 CEST 2006
>i am a newer to R and i am doing some protein category classification based on
>the amino acid sequence.while i have some questions urgently.
I'm not sure to understand exactly what you are trying to do, but if
this is a kind of clustering from the amino-acid composition of
proteins beware that the genomic G+C content is a major confounding
factor.
>1. any packages for analysis amino acid sequence
There are some utilities in the seqinR package to retrieve protein
sequences from various protein databases such as SwissProt. The
list of available databases (including nucleic databases) is given by:
library(seqinr)
choosebank()
[1] "genbank" "embl" "emblwgs" "swissprot" "ensembl"
[6] "ensemblnew" "emglib" "nrsub" "nbrf" "hobacnucl"
[11] "hobacprot" "hovernucl" "hoverprot" "hogennucl" "hogenprot"
[16] "hoverclnu" "hoverclpr" "homolensprot" "homolensnucl" "HAMAPnucl"
[21] "HAMAPprot" "hoppsigen" "nurebnucl" "nurebprot" "taxobacgen"
[26] "greview" "hogendnucl" "hogendprot" "refseq"
More info with the infobank argument set to TRUE:
choosebank(infobank = TRUE)[1:5,]
bank status
info
1 genbank on GenBank Rel. 154 (15 June 2006) Last
Updated: Jun 18, 2006
2 embl on EMBL Library
Release 86 (March 2006)
3 emblwgs on EMBL Whole Genome Shotgun sequences
Release 86 (March 2006)
4 swissprot on UniProt Rel. 7 (SWISS-PROT 49 + TrEMBL 32): Last
Updated: Jun 6, 2006
5 ensembl on
Ensembl databases rel 24
There are also some tools to compute protein indices such as the
Kyte and Doolittle hydrophaty index, or the PI (Isoelectric point
of a protein). See the examples in this document:
http://pbil.univ-lyon1.fr/software/SeqinR/seqinr_1_0-4.pdf
Most protein indices are linear forms on amino acid
relative frequencies whose coefficients are available in the
AAindex (http://www.genome.jp/aaindex/).
>2. given two sequences "AAA" and "BBB",how can i combine them into "AAABBB"
paste() is your friend:
paste("AAA", "BBB", sep = "")
[1] "AAABBB"
Or define your own infix binary operator for convenience:
"%+%" <- function(...) paste(..., sep = "")
"AAA" %+% "BBB" %+% "CCC"
[1] "AAABBBCCC"
>3. based on "AAABBB",how can i get some statistics of this string such as how
>many letters,how many "A"s in the string.
table(s2c("AAABBB"))
A B
3 3
With a real protein sequence:
prot <- read.fasta(File = system.file("sequences/seqAA.fasta",
package = "seqinr"), seqtype = "AA")
table(prot)
prot
* A C D E F G H I K L M N P Q R S T V W Y
1 8 6 6 18 6 8 1 9 14 29 5 7 10 9 13 16 7 6 3 1
If you want the three letter code instead:
table(prot) -> tmp
names(tmp) <- aaa(names(tmp))
tmp
Stp Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser
Thr Val Trp Tyr
1 8 6 6 18 6 8 1 9 14 29 5 7 10 9 13 16
7 6 3 1
Try also:
AAstat(prot[[1]])
Hope this helps,
--
Jean R. Lobry (lobry at biomserv.univ-lyon1.fr)
Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - LYON I,
43 Bd 11/11/1918, F-69622 VILLEURBANNE CEDEX, FRANCE
allo : +33 472 43 12 87 fax : +33 472 43 13 88
http://pbil.univ-lyon1.fr/members/lobry/
More information about the R-help
mailing list