[BioC] finding and deleting repeated observations
Ochsner, Scott A
sochsner at bcm.tmc.edu
Tue Jun 1 17:50:59 CEST 2010
Hi Mervi,
One solution is to order your data frame by "pvalue" using the order function and then to remove duplicate "GeneSymbol" using !duplicated.
> A<-c(12,2,4,15,11,9)
> B<-c(44,32,55,25,27,18)
> pvalue<-c(.01,.05,.2,.005,.002,.0001)
> GeneSymbol<-c(rep("ABC1",2),"AB",rep("ABCD1",3))
> tmp<-as.data.frame(cbind(A,B,pvalue))
> tmp<-cbind(GeneSymbol,tmp)
> tmp
GeneSymbol A B pvalue
1 ABC1 12 44 1e-02
2 ABC1 2 32 5e-02
3 AB 4 55 2e-01
4 ABCD1 15 25 5e-03
5 ABCD1 11 27 2e-03
6 ABCD1 9 18 1e-04
## reorder your dataframe by pvalue
> tmp.ordered <- tmp[order(tmp$pvalue),]
> tmp.ordered
GeneSymbol A B pvalue
6 ABCD1 9 18 1e-04
5 ABCD1 11 27 2e-03
4 ABCD1 15 25 5e-03
1 ABC1 12 44 1e-02
2 ABC1 2 32 5e-02
3 AB 4 55 2e-01
## select the first instance of a gene symbol and remove all others. Because you have ordered by pvalues you will automatically select the gene symbol with the lowest pvalue.
> tmp.sub<- tmp.ordered[!duplicated(tmp.ordered$GeneSymbol),]
> tmp.sub
GeneSymbol A B pvalue
6 ABCD1 9 18 1e-04
1 ABC1 12 44 1e-02
3 AB 4 55 2e-01
## reorder your data frame as before using the rownames.
> tmp.sub<-tmp.sub[order(rownames(tmp.sub)),]
> tmp.sub
GeneSymbol A B pvalue
1 ABC1 12 44 1e-02
3 AB 4 55 2e-01
6 ABCD1 9 18 1e-04
Scott
Scott A. Ochsner, PhD
One Baylor Plaza BCM130, Houston, TX 77030
Voice: (713) 798-6227 Fax: (713) 790-1275
-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of mervi.alanne at wri.fi
Sent: Friday, May 28, 2010 12:27 PM
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] finding and deleting repeated observations
Dear all,
I'm a novice with R and could use some help. How could I find repeated observations based on one column and select the one to keep based on another column?
In more detail, this is the thing I want to achieve:
-data.frame has 4 columns GeneSymbol, A, B, pvalue -data in column GeneSymbol may be repeated 1-6 times -data also contains unique observations -Of the repeated obs, keep the obs which has the lowest pvalue -Do not discard data from cols A and B
Example input dat
GeneSymbol A B pvalue
ABC1 12 44 0.01
ABC1 2 32 0.05
AB 4 55 0.2
ABCD1 15 25 0.005
ABCD1 11 27 0.002
ABCD1 9 18 0.0001
I'd like the output to look like this:
GeneSymbol A B pvalue
ABC1 2 32 0.01
AB 4 55 0.2
ABCD1 9 18 0.0001
Any suggestions?
-Mervi
Wihuri Research Institute
_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list