[BioC] finding and averaging replicate gene records

Sean Davis sdavis2 at mail.nih.gov
Wed Mar 16 13:02:35 CET 2005


On Mar 16, 2005, at 2:33 AM, zhihua li wrote:

> Hi netter!
>
> In most microarray slides a single gene will be represented by 
> multiple items. Sometimes it's unforseable because they have different 
> genbank accession numbers and you will not find them until you get a 
> unigene list for  all your gene items.
>
> Now I have a dataframe . The rows are gene records(accession number, 
> unigene ID and expression values in different conditions) ; the 1st 
> column is genbank accession numbers, the 2nd column is unigene IDs, 
> from 3rd column on are different conditions). All the accession 
> numbers are unique, but through unigene IDs i can find that some 
> items, though with different accession numbers, are in fact sharing 
> the same unigene ID. I would like to find the gene records containing 
> replicate unigene IDs and merge them into one record by averaging 
> different expression values in the same condition.
>
> Could anyone give me a clue about how to write the code? Or are there 
> any contributed functions can do this stuff?

If, after my last email, you still want to do this, look at ?aggregate.

#set up example
 > df <- 
data.frame(unigene=rep(c(letters[1:20]),5),matrix(rnorm(500),ncol=5))
 > dim(df)
[1] 100   6
 > df[1:5,]
   unigene          X1         X2         X3         X4         X5
1       a  0.30812107 -0.5310621 -0.9040957  0.7344379 -0.3356904
2       b -0.02764356  0.6196045 -1.2049073  1.3074086  1.7878118
3       c  0.79936647 -0.3430772  1.3319157 -0.1716195  1.5824703
4       d -1.52298039  0.7400511  1.6654934 -0.4796782 -1.6517931
5       e  0.20252950  0.6735963 -0.8631246 -1.2338265  0.8597014
# Aggregate the array values by "unigene" using mean.
 > df.unigene <- aggregate(df[,2:6],by=list(df$unigene),mean)
 > df.unigene
    Group.1          X1         X2           X3          X4          X5
1        a  0.27894974  0.3096306 -0.157369445 -0.02390716 -0.79865210
2        b -0.04005511  0.2069963  0.058276319  0.37695956  0.58892920
3        c  0.53853115 -0.7227620  0.542803169  0.72844079  0.33116364
4        d  0.04374438 -0.3302130  1.492462908 -0.19048229 -0.90463987
5        e -0.22403553  0.5079245  0.627224848 -1.30206042 -0.16849414
6        f -0.41708465 -0.9070749  0.133871146 -0.21337473 -0.20061087
7        g -0.38204229  0.6069678  0.050874510 -0.29334777 -0.11172384
8        h  0.58768574 -0.4863774  0.120376561 -0.31349966 -0.23951493
9        i -0.80005434 -0.3891139 -0.001995542 -0.17148142  0.06971404
10       j -0.35626038  0.8415595 -0.207348416  0.03932772 -0.09372701
11       k -0.30889392 -1.0870044 -0.447545956 -0.48184160 -0.10491062
12       l -0.47169100 -0.1602827  1.084106985 -0.26736429  0.08239815
13       m -0.12285248 -0.4367895  0.354743839  0.10013901  0.42580119
14       n -0.17691859 -0.8934232  0.399016113  0.73876068  0.61432185
15       o -0.08250122  0.6402547  0.029047584 -0.30060666  0.36726071
16       p -0.20336659  0.2853576 -0.272979841 -0.57747797  0.24284977
17       q  0.00947679 -0.3849657 -0.198965209 -0.38048787 -0.87557376
18       r  0.30445158  0.4110414  0.181761757 -0.21715431  0.23009438
19       s -0.30325431 -0.1010338 -0.298426526 -1.23178516 -0.37827590
20       t -0.30316005 -0.4389324 -1.050242565  0.12818715 -0.31785596
 > dim(df.unigene)
[1] 20  6



More information about the Bioconductor mailing list