[BioC] Quickest way to convert IDs in a data frame?

James W. MacDonald jmacdon at uw.edu
Thu Jul 25 22:20:14 CEST 2013


Hi Enrico,

Please don't take things off-list (e.g., use reply-all).


On 7/25/2013 2:17 PM, Enrico Ferrero wrote:
> Hi James,
>
> Thanks very much for your help.
> There is an issue that needs to be solved before thinking about what's
> the best approach in my opinion.
>
> I don't understand why, but the object created with the call to select
> (test in my example, first.two in yours) has a different number of
> rows from the original object (df in my example). Specifically it has
> *less* rows. If all symbols were converted to all possible Entrez IDs,
> I would expect it to have more rows, not less. To me, it looks like
> not all rows are looked up and returned.
>
> Do you see what I mean?

Sure. You could be using outdated gene symbols. Or perhaps you are using 
a mixture of symbols and aliases. Which is even cooler than just all 
symbols:

 > symb <- c(Rkeys(org.Hs.egSYMBOL)[1:10], Rkeys(org.Hs.egALIAS2EG)[31:45])
 > symb
  [1] "A1BG"     "A2M"      "A2MP1"    "NAT1"     "NAT2"     "AACP"
  [7] "SERPINA3" "AADAC"    "AAMP"     "AANAT"    "AAMP"     "AANAT"
[13] "DSPS"     "SNAT"     "AARS"     "CMT2N"    "AAV"      "AAVS1"
[19] "ABAT"     "GABA-AT"  "GABAT"    "NPD009"   "ABC-1"    "ABC1"
[25] "ABCA1"
 > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL")
      SYMBOL ENTREZID
1      A1BG        1
2       A2M        2
3     A2MP1        3
4      NAT1        9
5      NAT2       10
6      AACP       11
7  SERPINA3       12
8     AADAC       13
9      AAMP       14
10    AANAT       15
11     AAMP       14
12    AANAT       15
13     DSPS <NA>
14     SNAT <NA>
15     AARS       16
16    CMT2N <NA>
17      AAV <NA>
18    AAVS1       17
19     ABAT       18
20  GABA-AT <NA>
21    GABAT <NA>
22   NPD009 <NA>
23    ABC-1 <NA>
24     ABC1 <NA>
25    ABCA1       19
 > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS")
       ALIAS ENTREZID
1      A1BG        1
2       A2M        2
3     A2MP1        3
4      NAT1        9
5      NAT1     1982
6      NAT1     6530
7      NAT1    10991
8      NAT2       10
9      NAT2    81539
10     AACP       11
11 SERPINA3       12
12    AADAC       13
13     AAMP       14
14    AANAT       15
15     DSPS       15
16     SNAT       15
17     AARS       16
18    CMT2N       16
19      AAV       17
20    AAVS1       17
21     ABAT       18
22  GABA-AT       18
23    GABAT       18
24   NPD009       18
25    ABC-1       19
26     ABC1       19
27     ABC1    63897
28    ABCA1       19
Warning message:
In .generateExtraRows(tab, keys, jointype) :
   'select' and duplicate query keys resulted in 1:many mapping between
keys and return rows
 > mget(c("1982","6530","10991"), org.Hs.egGENENAME)
$`1982`
[1] "eukaryotic translation initiation factor 4 gamma, 2"

$`6530`
[1] "solute carrier family 6 (neurotransmitter transporter, 
noradrenalin), member 2"

$`10991`
[1] "solute carrier family 38, member 3"

Best,

Jim

>
> On 25 July 2013 18:17, James W. MacDonald<jmacdon at uw.edu>  wrote:
>> Hi Enrico,
>>
>>
>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote:
>>> Dear James,
>>>
>>> Thanks very much for your prompt reply.
>>> I knew the problem was the for loop and the select function is indeed
>>> a lot faster than that and works perfectly with toy data.
>>>
>>> However, this is what happens when I try to use it with real data:
>>>
>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS",
>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>> Warning message:
>>> In .generateExtraRows(tab, keys, jointype) :
>>>     'select' and duplicate query keys resulted in 1:many mapping between
>>> keys and return rows
>>>
>>> which is probably the warning you mentioned.
>>
>> That's not the warning I mentioned, but it does point out the same issue,
>> which is that there is a one to many mapping between symbol and entrez gene
>> ID.
>>
>> So now you have to decide if you want to be naive (or stupid, depending on
>> your perspective) or not. You could just cover your eyes and do this:
>>
>> first.two<- first.two[!duplicated(first.two$SYMBOL),]
>>
>> which will choose for you the first symbol ->  gene ID mapping and nuke the
>> rest. That's nice and quick, but you are making huge assumptions.
>>
>> Or you could decide to be a bit more sophisticated and do something like
>>
>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x)
>> first.two[x,])
>>
>> At this point you can take a look at e.g., thelst[1:10] to see what we just
>> did
>>
>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], paste(x[,2],
>> collapse = "|")))
>>
>> and here you can look at head(thelst).
>>
>> Then you can check to ensure that the first column of thelst is identical to
>> the first column of df, and proceed as before.
>>
>> But there is still the problem of the multiple mappings. As an example:
>>
>>> thelst[1:5]
>> $HBD
>>       SYMBOL  ENTREZID
>> 2535    HBD      3045
>> 2536    HBD 100187828
>>
>> $KIR3DL3
>>         SYMBOL  ENTREZID
>> 17513 KIR3DL3    115653
>> 17514 KIR3DL3 100133046
>>
>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME)
>> $`3045`
>> [1] "hemoglobin, delta"
>>
>> $`100187828`
>> [1] "hypophosphatemic bone disease"
>>
>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME)
>> $`115653`
>> [1] "killer cell immunoglobulin-like receptor, three domains, long
>> cytoplasmic tail, 3"
>>
>> $`100133046`
>> [1] "killer cell immunoglobulin-like receptor three domains long cytoplasmic
>> tail 3"
>>
>>
>> So HBD is the gene symbol for two different genes! If this gene symbol is in
>> your data, you will now have attributed your data to two genes that
>> apparently are not remotely similar. if KIR3DL3 is in your data, then it
>> worked out OK for that gene.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>
>>
>>> The real problem is that the number of rows is now different for the 2
>>> objects:
>>>> nrow(df); nrow(test)
>>> [1] 573
>>> [1] 201
>>>
>>> So I obviously can't put the new data into the original df. My
>>> impression is that when the 1 to many mapping arises, the select
>>> functions exits, with that warning message. As a result, my test
>>> object is incomplete.
>>>
>>> On top of that, and I can't really explain this, the row positions are
>>> messed up, e.g.
>>>
>>>> all.equal(df[100,],test[100,])
>>> returns FALSE.
>>>
>>> How can I work around this?
>>>
>>> Thanks a  lot!
>>>
>>> Best,
>>>
>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at uw.edu>   wrote:
>>>> Hi Enrico,
>>>>
>>>>
>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
>>>>> Hello,
>>>>>
>>>>> I often have data frames where I need to perform ID conversions on one
>>>>> or
>>>>> more of the columns while preserving the order of the rows, e.g.:
>>>>>
>>>>> GeneSymbol    Value1    Value2
>>>>> GS1    2.5    0.1
>>>>> GS2    3    0.2
>>>>> ..
>>>>>
>>>>> And I want to obtain:
>>>>>
>>>>> GeneSymbol    EntrezGeneID    Value1    Value2
>>>>> GS1    EG1    2.5    0.1
>>>>> GS2    EG2    3    0.2
>>>>> ..
>>>>>
>>>>> What I've done so far was to create a function that uses org.Hs.eg.db to
>>>>> loop over the rows of the column and does the conversion:
>>>>>
>>>>> library(org.Hs.eg.db)
>>>>> alias2EG<- function(x) {
>>>>> for (i in 1:length(x)) {
>>>>> if (!is.na(x[i])) {
>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
>>>>> if (!is.null(repl)) {
>>>>> x[i]<- repl
>>>>> }
>>>>> else {
>>>>> x[i]<- NA
>>>>> }
>>>>> }
>>>>> }
>>>>> return(x)
>>>>> }
>>>>
>>>> I should first note that gene symbols are not unique, so you are taking a
>>>> chance on your mappings. Is there no other annotation for your data?
>>>>
>>>> In addition, you should note that it is almost always better to think of
>>>> objects as vectors and matrices in R, rather than as things that need to
>>>> be
>>>> looped over (e.g., R isn't Perl or C).
>>>>
>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID",
>>>> "SYMBOL")
>>>>
>>>> Note that there used to be a warning or an error (don't remember which)
>>>> when
>>>> you did something like this, stating that gene symbols are not unique,
>>>> and
>>>> that you shouldn't do this sort of thing. Apparently this warning has
>>>> been
>>>> removed, but the issue remains valid.
>>>>
>>>> ## check yourself
>>>>
>>>> all.equal(df$GeneSymbol, first.two$SYMBOL)
>>>>
>>>> ## if true, proceed
>>>>
>>>> df<- data.frame(first.two, df[,-1])
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>>> and then call the function like this:
>>>>>
>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>>>>>
>>>>> This works well, but gets very slow when I need to do multiple
>>>>> conversions
>>>>> on large datasets.
>>>>>
>>>>> Is there any way I can achieve the same result but in a quicker, more
>>>>> efficient way?
>>>>>
>>>>> Thank you.
>>>>>
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>>>
>>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list