[BioC] Quickest way to convert IDs in a data frame?
James W. MacDonald
jmacdon at uw.edu
Thu Jul 25 17:58:34 CEST 2013
Hi Enrico,
On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
> Hello,
>
> I often have data frames where I need to perform ID conversions on one or
> more of the columns while preserving the order of the rows, e.g.:
>
> GeneSymbol Value1 Value2
> GS1 2.5 0.1
> GS2 3 0.2
> ..
>
> And I want to obtain:
>
> GeneSymbol EntrezGeneID Value1 Value2
> GS1 EG1 2.5 0.1
> GS2 EG2 3 0.2
> ..
>
> What I've done so far was to create a function that uses org.Hs.eg.db to
> loop over the rows of the column and does the conversion:
>
> library(org.Hs.eg.db)
> alias2EG<- function(x) {
> for (i in 1:length(x)) {
> if (!is.na(x[i])) {
> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
> if (!is.null(repl)) {
> x[i]<- repl
> }
> else {
> x[i]<- NA
> }
> }
> }
> return(x)
> }
I should first note that gene symbols are not unique, so you are taking
a chance on your mappings. Is there no other annotation for your data?
In addition, you should note that it is almost always better to think of
objects as vectors and matrices in R, rather than as things that need to
be looped over (e.g., R isn't Perl or C).
first.two <- select(org.Hs.eg.db, as.character(df$GeneSymbol),
"ENTREZID", "SYMBOL")
Note that there used to be a warning or an error (don't remember which)
when you did something like this, stating that gene symbols are not
unique, and that you shouldn't do this sort of thing. Apparently this
warning has been removed, but the issue remains valid.
## check yourself
all.equal(df$GeneSymbol, first.two$SYMBOL)
## if true, proceed
df <- data.frame(first.two, df[,-1])
Best,
Jim
>
> and then call the function like this:
>
> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>
> This works well, but gets very slow when I need to do multiple conversions
> on large datasets.
>
> Is there any way I can achieve the same result but in a quicker, more
> efficient way?
>
> Thank you.
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list