[BioC] Quickest way to convert IDs in a data frame?

Thu Jul 25 18:56:43 CEST 2013

Dear James,

Thanks very much for your prompt reply.
I knew the problem was the for loop and the select function is indeed
a lot faster than that and works perfectly with toy data.

However, this is what happens when I try to use it with real data:

> test <- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL"))
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' and duplicate query keys resulted in 1:many mapping between
keys and return rows

which is probably the warning you mentioned.

The real problem is that the number of rows is now different for the 2 objects:
> nrow(df); nrow(test)
[1] 573
[1] 201

So I obviously can't put the new data into the original df. My
impression is that when the 1 to many mapping arises, the select
functions exits, with that warning message. As a result, my test
object is incomplete.

On top of that, and I can't really explain this, the row positions are
messed up, e.g.

> all.equal(df[100,],test[100,])
returns FALSE.

How can I work around this?

Thanks a  lot!

Best,

On 25 July 2013 16:58, James W. MacDonald <jmacdon at uw.edu> wrote:
> Hi Enrico,
>
>
> On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
>>
>> Hello,
>>
>> I often have data frames where I need to perform ID conversions on one or
>> more of the columns while preserving the order of the rows, e.g.:
>>
>> GeneSymbol    Value1    Value2
>> GS1    2.5    0.1
>> GS2    3    0.2
>> ..
>>
>> And I want to obtain:
>>
>> GeneSymbol    EntrezGeneID    Value1    Value2
>> GS1    EG1    2.5    0.1
>> GS2    EG2    3    0.2
>> ..
>>
>> What I've done so far was to create a function that uses org.Hs.eg.db to
>> loop over the rows of the column and does the conversion:
>>
>> library(org.Hs.eg.db)
>> alias2EG<- function(x) {
>> for (i in 1:length(x)) {
>> if (!is.na(x[i])) {
>> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
>> if (!is.null(repl)) {
>> x[i]<- repl
>> }
>> else {
>> x[i]<- NA
>> }
>> }
>> }
>> return(x)
>> }
>
>
> I should first note that gene symbols are not unique, so you are taking a
> chance on your mappings. Is there no other annotation for your data?
>
> In addition, you should note that it is almost always better to think of
> objects as vectors and matrices in R, rather than as things that need to be
> looped over (e.g., R isn't Perl or C).
>
> first.two <- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID",
> "SYMBOL")
>
> Note that there used to be a warning or an error (don't remember which) when
> you did something like this, stating that gene symbols are not unique, and
> that you shouldn't do this sort of thing. Apparently this warning has been
> removed, but the issue remains valid.
>
> ## check yourself
>
> all.equal(df$GeneSymbol, first.two$SYMBOL)
>
> ## if true, proceed
>
> df <- data.frame(first.two, df[,-1])
>
> Best,
>
> Jim
>
>
>
>>
>> and then call the function like this:
>>
>> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>>
>> This works well, but gets very slow when I need to do multiple conversions
>> on large datasets.
>>
>> Is there any way I can achieve the same result but in a quicker, more
>> efficient way?
>>
>> Thank you.
>>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>

-- 
Enrico Ferrero
PhD Student
Department of Genetics
Cambridge Systems Biology Centre
University of Cambridge