[BioC] finding and averaging replicate gene records
Sean Davis
sdavis2 at mail.nih.gov
Wed Mar 16 13:02:35 CET 2005
On Mar 16, 2005, at 2:33 AM, zhihua li wrote:
> Hi netter!
>
> In most microarray slides a single gene will be represented by
> multiple items. Sometimes it's unforseable because they have different
> genbank accession numbers and you will not find them until you get a
> unigene list for all your gene items.
>
> Now I have a dataframe . The rows are gene records(accession number,
> unigene ID and expression values in different conditions) ; the 1st
> column is genbank accession numbers, the 2nd column is unigene IDs,
> from 3rd column on are different conditions). All the accession
> numbers are unique, but through unigene IDs i can find that some
> items, though with different accession numbers, are in fact sharing
> the same unigene ID. I would like to find the gene records containing
> replicate unigene IDs and merge them into one record by averaging
> different expression values in the same condition.
>
> Could anyone give me a clue about how to write the code? Or are there
> any contributed functions can do this stuff?
If, after my last email, you still want to do this, look at ?aggregate.
#set up example
> df <-
data.frame(unigene=rep(c(letters[1:20]),5),matrix(rnorm(500),ncol=5))
> dim(df)
[1] 100 6
> df[1:5,]
unigene X1 X2 X3 X4 X5
1 a 0.30812107 -0.5310621 -0.9040957 0.7344379 -0.3356904
2 b -0.02764356 0.6196045 -1.2049073 1.3074086 1.7878118
3 c 0.79936647 -0.3430772 1.3319157 -0.1716195 1.5824703
4 d -1.52298039 0.7400511 1.6654934 -0.4796782 -1.6517931
5 e 0.20252950 0.6735963 -0.8631246 -1.2338265 0.8597014
# Aggregate the array values by "unigene" using mean.
> df.unigene <- aggregate(df[,2:6],by=list(df$unigene),mean)
> df.unigene
Group.1 X1 X2 X3 X4 X5
1 a 0.27894974 0.3096306 -0.157369445 -0.02390716 -0.79865210
2 b -0.04005511 0.2069963 0.058276319 0.37695956 0.58892920
3 c 0.53853115 -0.7227620 0.542803169 0.72844079 0.33116364
4 d 0.04374438 -0.3302130 1.492462908 -0.19048229 -0.90463987
5 e -0.22403553 0.5079245 0.627224848 -1.30206042 -0.16849414
6 f -0.41708465 -0.9070749 0.133871146 -0.21337473 -0.20061087
7 g -0.38204229 0.6069678 0.050874510 -0.29334777 -0.11172384
8 h 0.58768574 -0.4863774 0.120376561 -0.31349966 -0.23951493
9 i -0.80005434 -0.3891139 -0.001995542 -0.17148142 0.06971404
10 j -0.35626038 0.8415595 -0.207348416 0.03932772 -0.09372701
11 k -0.30889392 -1.0870044 -0.447545956 -0.48184160 -0.10491062
12 l -0.47169100 -0.1602827 1.084106985 -0.26736429 0.08239815
13 m -0.12285248 -0.4367895 0.354743839 0.10013901 0.42580119
14 n -0.17691859 -0.8934232 0.399016113 0.73876068 0.61432185
15 o -0.08250122 0.6402547 0.029047584 -0.30060666 0.36726071
16 p -0.20336659 0.2853576 -0.272979841 -0.57747797 0.24284977
17 q 0.00947679 -0.3849657 -0.198965209 -0.38048787 -0.87557376
18 r 0.30445158 0.4110414 0.181761757 -0.21715431 0.23009438
19 s -0.30325431 -0.1010338 -0.298426526 -1.23178516 -0.37827590
20 t -0.30316005 -0.4389324 -1.050242565 0.12818715 -0.31785596
> dim(df.unigene)
[1] 20 6
More information about the Bioconductor
mailing list