[BioC] Quickest way to convert IDs in a data frame?

James W. MacDonald jmacdon at uw.edu
Thu Jul 25 17:58:34 CEST 2013


Hi Enrico,

On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
> Hello,
>
> I often have data frames where I need to perform ID conversions on one or
> more of the columns while preserving the order of the rows, e.g.:
>
> GeneSymbol    Value1    Value2
> GS1    2.5    0.1
> GS2    3    0.2
> ..
>
> And I want to obtain:
>
> GeneSymbol    EntrezGeneID    Value1    Value2
> GS1    EG1    2.5    0.1
> GS2    EG2    3    0.2
> ..
>
> What I've done so far was to create a function that uses org.Hs.eg.db to
> loop over the rows of the column and does the conversion:
>
> library(org.Hs.eg.db)
> alias2EG<- function(x) {
> for (i in 1:length(x)) {
> if (!is.na(x[i])) {
> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
> if (!is.null(repl)) {
> x[i]<- repl
> }
> else {
> x[i]<- NA
> }
> }
> }
> return(x)
> }

I should first note that gene symbols are not unique, so you are taking 
a chance on your mappings. Is there no other annotation for your data?

In addition, you should note that it is almost always better to think of 
objects as vectors and matrices in R, rather than as things that need to 
be looped over (e.g., R isn't Perl or C).

first.two <- select(org.Hs.eg.db, as.character(df$GeneSymbol), 
"ENTREZID", "SYMBOL")

Note that there used to be a warning or an error (don't remember which) 
when you did something like this, stating that gene symbols are not 
unique, and that you shouldn't do this sort of thing. Apparently this 
warning has been removed, but the issue remains valid.

## check yourself

all.equal(df$GeneSymbol, first.two$SYMBOL)

## if true, proceed

df <- data.frame(first.two, df[,-1])

Best,

Jim


>
> and then call the function like this:
>
> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>
> This works well, but gets very slow when I need to do multiple conversions
> on large datasets.
>
> Is there any way I can achieve the same result but in a quicker, more
> efficient way?
>
> Thank you.
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list