[BioC] Quickest way to convert IDs in a data frame?

Thu Jul 25 23:57:35 CEST 2013

Hi Enrico and Herve,

This has to do with duplicate entries, but only when the duplicate entry 
maps to many ENTREZID:

 > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS")
     ALIAS ENTREZID
1 ADORA2A      135
2 ADORA2A      135
3 ADORA2A      135
4 ADORA2A      135

 > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS")
   ALIAS ENTREZID
1   AGT      183
2   AGT      189
Warning message:
In .generateExtraRows(tab, keys, jointype) :
   'select' and duplicate query keys resulted in 1:many mapping between
keys and return rows

 > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS")
   ALIAS ENTREZID
1   AGT      183
2   AGT      189
Warning message:
In .generateExtraRows(tab, keys, jointype) :
   'select' resulted in 1:many mapping between keys and return rows

So in the instances where a gene symbol maps to more than one ENTREZID, 
the output gets truncated, whereas if it is a one-to-one mapping, it 
does not.

Best,

Jim

On 7/25/2013 5:06 PM, Enrico Ferrero wrote:
> Hi,
>
> Hervé, that's exactly what I'm trying to say.
>
> Attached to this email is a tab delimited file with two columns of
> GeneSymbols (or Aliases), and here is some simple code to reproduce
> the unexpected behaviour:
>
> library(org.Hs.eg.db)
> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE)
> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS",
> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
> # check that mytest has less rows than mydf
> nrow(mydf)
> nrow(mytest)
> # pick a random row: they don't match
> mydf[250,]
> mytest[250,]
>
> Ideally, mytest should have the same number and position of rows of
> mydf so that I can then cbind them.
> If mytest has more rows because of multiple mappings that's also fine:
> I can always use merge(mydf, mytest), right?
>
> Thanks a lot to both for your help, it's very appreciated.
> Best,
>
>
> On 25 July 2013 21:32, Hervé Pagès<hpages at fhcrc.org>  wrote:
>> Hi Enrico,
>>
>>
>> On 07/25/2013 01:20 PM, James W. MacDonald wrote:
>>> Hi Enrico,
>>>
>>> Please don't take things off-list (e.g., use reply-all).
>>>
>>>
>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote:
>>>> Hi James,
>>>>
>>>> Thanks very much for your help.
>>>> There is an issue that needs to be solved before thinking about what's
>>>> the best approach in my opinion.
>>>>
>>>> I don't understand why, but the object created with the call to select
>>>> (test in my example, first.two in yours) has a different number of
>>>> rows from the original object (df in my example). Specifically it has
>>>> *less* rows.
>>
>> I'm surprised it has less rows. It can definitely have more, when some
>> of the keys passed to select() are mapped to more than 1 row, but my
>> understanding was that select() would propagate unmapped keys to the
>> output by placing them in rows stuffed with NAs. So maybe I
>> misunderstood how select() works, or its behavior was changed, or
>> there is a bug somewhere. Could you please send the code that allows
>> us to reproduce this? Thanks.
>>
>> H.
>>
>>
>>> If all symbols were converted to all possible Entrez IDs,
>>>> I would expect it to have more rows, not less. To me, it looks like
>>>> not all rows are looked up and returned.
>>>>
>>>> Do you see what I mean?
>>>
>>> Sure. You could be using outdated gene symbols. Or perhaps you are using
>>> a mixture of symbols and aliases. Which is even cooler than just all
>>> symbols:
>>>
>>>   >  symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10],
>>> Rkeys(org.Hs.egALIAS2EG)[31:45])
>>>   >  symb
>>>    [1] "A1BG"     "A2M"      "A2MP1"    "NAT1"     "NAT2"     "AACP"
>>>    [7] "SERPINA3" "AADAC"    "AAMP"     "AANAT"    "AAMP"     "AANAT"
>>> [13] "DSPS"     "SNAT"     "AARS"     "CMT2N"    "AAV"      "AAVS1"
>>> [19] "ABAT"     "GABA-AT"  "GABAT"    "NPD009"   "ABC-1"    "ABC1"
>>> [25] "ABCA1"
>>>   >  select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL")
>>>        SYMBOL ENTREZID
>>> 1      A1BG        1
>>> 2       A2M        2
>>> 3     A2MP1        3
>>> 4      NAT1        9
>>> 5      NAT2       10
>>> 6      AACP       11
>>> 7  SERPINA3       12
>>> 8     AADAC       13
>>> 9      AAMP       14
>>> 10    AANAT       15
>>> 11     AAMP       14
>>> 12    AANAT       15
>>> 13     DSPS<NA>
>>> 14     SNAT<NA>
>>> 15     AARS       16
>>> 16    CMT2N<NA>
>>> 17      AAV<NA>
>>> 18    AAVS1       17
>>> 19     ABAT       18
>>> 20  GABA-AT<NA>
>>> 21    GABAT<NA>
>>> 22   NPD009<NA>
>>> 23    ABC-1<NA>
>>> 24     ABC1<NA>
>>> 25    ABCA1       19
>>>   >  select(org.Hs.eg.db, symb, "ENTREZID","ALIAS")
>>>         ALIAS ENTREZID
>>> 1      A1BG        1
>>> 2       A2M        2
>>> 3     A2MP1        3
>>> 4      NAT1        9
>>> 5      NAT1     1982
>>> 6      NAT1     6530
>>> 7      NAT1    10991
>>> 8      NAT2       10
>>> 9      NAT2    81539
>>> 10     AACP       11
>>> 11 SERPINA3       12
>>> 12    AADAC       13
>>> 13     AAMP       14
>>> 14    AANAT       15
>>> 15     DSPS       15
>>> 16     SNAT       15
>>> 17     AARS       16
>>> 18    CMT2N       16
>>> 19      AAV       17
>>> 20    AAVS1       17
>>> 21     ABAT       18
>>> 22  GABA-AT       18
>>> 23    GABAT       18
>>> 24   NPD009       18
>>> 25    ABC-1       19
>>> 26     ABC1       19
>>> 27     ABC1    63897
>>> 28    ABCA1       19
>>> Warning message:
>>> In .generateExtraRows(tab, keys, jointype) :
>>>     'select' and duplicate query keys resulted in 1:many mapping between
>>> keys and return rows
>>>   >  mget(c("1982","6530","10991"), org.Hs.egGENENAME)
>>> $`1982`
>>> [1] "eukaryotic translation initiation factor 4 gamma, 2"
>>>
>>> $`6530`
>>> [1] "solute carrier family 6 (neurotransmitter transporter,
>>> noradrenalin), member 2"
>>>
>>> $`10991`
>>> [1] "solute carrier family 38, member 3"
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at uw.edu>   wrote:
>>>>> Hi Enrico,
>>>>>
>>>>>
>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote:
>>>>>> Dear James,
>>>>>>
>>>>>> Thanks very much for your prompt reply.
>>>>>> I knew the problem was the for loop and the select function is indeed
>>>>>> a lot faster than that and works perfectly with toy data.
>>>>>>
>>>>>> However, this is what happens when I try to use it with real data:
>>>>>>
>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS",
>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>>>>> Warning message:
>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>>      'select' and duplicate query keys resulted in 1:many mapping
>>>>>> between
>>>>>> keys and return rows
>>>>>>
>>>>>> which is probably the warning you mentioned.
>>>>>
>>>>> That's not the warning I mentioned, but it does point out the same
>>>>> issue,
>>>>> which is that there is a one to many mapping between symbol and
>>>>> entrez gene
>>>>> ID.
>>>>>
>>>>> So now you have to decide if you want to be naive (or stupid,
>>>>> depending on
>>>>> your perspective) or not. You could just cover your eyes and do this:
>>>>>
>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),]
>>>>>
>>>>> which will choose for you the first symbol ->   gene ID mapping and
>>>>> nuke the
>>>>> rest. That's nice and quick, but you are making huge assumptions.
>>>>>
>>>>> Or you could decide to be a bit more sophisticated and do something like
>>>>>
>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x)
>>>>> first.two[x,])
>>>>>
>>>>> At this point you can take a look at e.g., thelst[1:10] to see what
>>>>> we just
>>>>> did
>>>>>
>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1],
>>>>> paste(x[,2],
>>>>> collapse = "|")))
>>>>>
>>>>> and here you can look at head(thelst).
>>>>>
>>>>> Then you can check to ensure that the first column of thelst is
>>>>> identical to
>>>>> the first column of df, and proceed as before.
>>>>>
>>>>> But there is still the problem of the multiple mappings. As an example:
>>>>>
>>>>>> thelst[1:5]
>>>>> $HBD
>>>>>        SYMBOL  ENTREZID
>>>>> 2535    HBD      3045
>>>>> 2536    HBD 100187828
>>>>>
>>>>> $KIR3DL3
>>>>>          SYMBOL  ENTREZID
>>>>> 17513 KIR3DL3    115653
>>>>> 17514 KIR3DL3 100133046
>>>>>
>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME)
>>>>> $`3045`
>>>>> [1] "hemoglobin, delta"
>>>>>
>>>>> $`100187828`
>>>>> [1] "hypophosphatemic bone disease"
>>>>>
>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME)
>>>>> $`115653`
>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, long
>>>>> cytoplasmic tail, 3"
>>>>>
>>>>> $`100133046`
>>>>> [1] "killer cell immunoglobulin-like receptor three domains long
>>>>> cytoplasmic
>>>>> tail 3"
>>>>>
>>>>>
>>>>> So HBD is the gene symbol for two different genes! If this gene
>>>>> symbol is in
>>>>> your data, you will now have attributed your data to two genes that
>>>>> apparently are not remotely similar. if KIR3DL3 is in your data, then it
>>>>> worked out OK for that gene.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> The real problem is that the number of rows is now different for the 2
>>>>>> objects:
>>>>>>> nrow(df); nrow(test)
>>>>>> [1] 573
>>>>>> [1] 201
>>>>>>
>>>>>> So I obviously can't put the new data into the original df. My
>>>>>> impression is that when the 1 to many mapping arises, the select
>>>>>> functions exits, with that warning message. As a result, my test
>>>>>> object is incomplete.
>>>>>>
>>>>>> On top of that, and I can't really explain this, the row positions are
>>>>>> messed up, e.g.
>>>>>>
>>>>>>> all.equal(df[100,],test[100,])
>>>>>> returns FALSE.
>>>>>>
>>>>>> How can I work around this?
>>>>>>
>>>>>> Thanks a  lot!
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at uw.edu>    wrote:
>>>>>>> Hi Enrico,
>>>>>>>
>>>>>>>
>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I often have data frames where I need to perform ID conversions on
>>>>>>>> one
>>>>>>>> or
>>>>>>>> more of the columns while preserving the order of the rows, e.g.:
>>>>>>>>
>>>>>>>> GeneSymbol    Value1    Value2
>>>>>>>> GS1    2.5    0.1
>>>>>>>> GS2    3    0.2
>>>>>>>> ..
>>>>>>>>
>>>>>>>> And I want to obtain:
>>>>>>>>
>>>>>>>> GeneSymbol    EntrezGeneID    Value1    Value2
>>>>>>>> GS1    EG1    2.5    0.1
>>>>>>>> GS2    EG2    3    0.2
>>>>>>>> ..
>>>>>>>>
>>>>>>>> What I've done so far was to create a function that uses
>>>>>>>> org.Hs.eg.db to
>>>>>>>> loop over the rows of the column and does the conversion:
>>>>>>>>
>>>>>>>> library(org.Hs.eg.db)
>>>>>>>> alias2EG<- function(x) {
>>>>>>>> for (i in 1:length(x)) {
>>>>>>>> if (!is.na(x[i])) {
>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
>>>>>>>> if (!is.null(repl)) {
>>>>>>>> x[i]<- repl
>>>>>>>> }
>>>>>>>> else {
>>>>>>>> x[i]<- NA
>>>>>>>> }
>>>>>>>> }
>>>>>>>> }
>>>>>>>> return(x)
>>>>>>>> }
>>>>>>>
>>>>>>> I should first note that gene symbols are not unique, so you are
>>>>>>> taking a
>>>>>>> chance on your mappings. Is there no other annotation for your data?
>>>>>>>
>>>>>>> In addition, you should note that it is almost always better to
>>>>>>> think of
>>>>>>> objects as vectors and matrices in R, rather than as things that
>>>>>>> need to
>>>>>>> be
>>>>>>> looped over (e.g., R isn't Perl or C).
>>>>>>>
>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol),
>>>>>>> "ENTREZID",
>>>>>>> "SYMBOL")
>>>>>>>
>>>>>>> Note that there used to be a warning or an error (don't remember
>>>>>>> which)
>>>>>>> when
>>>>>>> you did something like this, stating that gene symbols are not unique,
>>>>>>> and
>>>>>>> that you shouldn't do this sort of thing. Apparently this warning has
>>>>>>> been
>>>>>>> removed, but the issue remains valid.
>>>>>>>
>>>>>>> ## check yourself
>>>>>>>
>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL)
>>>>>>>
>>>>>>> ## if true, proceed
>>>>>>>
>>>>>>> df<- data.frame(first.two, df[,-1])
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> and then call the function like this:
>>>>>>>>
>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>>>>>>>>
>>>>>>>> This works well, but gets very slow when I need to do multiple
>>>>>>>> conversions
>>>>>>>> on large datasets.
>>>>>>>>
>>>>>>>> Is there any way I can achieve the same result but in a quicker, more
>>>>>>>> efficient way?
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>> --
>>>>>>> James W. MacDonald, M.S.
>>>>>>> Biostatistician
>>>>>>> University of Washington
>>>>>>> Environmental and Occupational Health Sciences
>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>> Seattle WA 98105-6099
>>>>>>>
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> University of Washington
>>>>> Environmental and Occupational Health Sciences
>>>>> 4225 Roosevelt Way NE, # 100
>>>>> Seattle WA 98105-6099
>>>>>
>>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099