[BioC] Quickest way to convert IDs in a data frame?

Hervé Pagès hpages at fhcrc.org
Fri Jul 26 00:18:10 CEST 2013


Hi James,

You're right.

It's actually both: NAs *and* duplicated keys that are mapped to
more than 1 row are removed from the input. I don't think this
is documented.

I wonder if select() behavior couldn't be a little bit simpler by
either preserving or removing all duplicated keys, and not just some
of them (on a somewhat arbitrary criteria).

Thanks,
H.


On 07/25/2013 02:57 PM, James W. MacDonald wrote:
> Hi Enrico and Herve,
>
> This has to do with duplicate entries, but only when the duplicate entry
> maps to many ENTREZID:
>
>  > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS")
>      ALIAS ENTREZID
> 1 ADORA2A      135
> 2 ADORA2A      135
> 3 ADORA2A      135
> 4 ADORA2A      135
>
>  > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS")
>    ALIAS ENTREZID
> 1   AGT      183
> 2   AGT      189
> Warning message:
> In .generateExtraRows(tab, keys, jointype) :
>    'select' and duplicate query keys resulted in 1:many mapping between
> keys and return rows
>
>  > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS")
>    ALIAS ENTREZID
> 1   AGT      183
> 2   AGT      189
> Warning message:
> In .generateExtraRows(tab, keys, jointype) :
>    'select' resulted in 1:many mapping between keys and return rows
>
>
> So in the instances where a gene symbol maps to more than one ENTREZID,
> the output gets truncated, whereas if it is a one-to-one mapping, it
> does not.
>
> Best,
>
> Jim
>
>
>
>
> On 7/25/2013 5:06 PM, Enrico Ferrero wrote:
>> Hi,
>>
>> Hervé, that's exactly what I'm trying to say.
>>
>> Attached to this email is a tab delimited file with two columns of
>> GeneSymbols (or Aliases), and here is some simple code to reproduce
>> the unexpected behaviour:
>>
>> library(org.Hs.eg.db)
>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE)
>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS",
>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>> # check that mytest has less rows than mydf
>> nrow(mydf)
>> nrow(mytest)
>> # pick a random row: they don't match
>> mydf[250,]
>> mytest[250,]
>>
>> Ideally, mytest should have the same number and position of rows of
>> mydf so that I can then cbind them.
>> If mytest has more rows because of multiple mappings that's also fine:
>> I can always use merge(mydf, mytest), right?
>>
>> Thanks a lot to both for your help, it's very appreciated.
>> Best,
>>
>>
>> On 25 July 2013 21:32, Hervé Pagès<hpages at fhcrc.org>  wrote:
>>> Hi Enrico,
>>>
>>>
>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote:
>>>> Hi Enrico,
>>>>
>>>> Please don't take things off-list (e.g., use reply-all).
>>>>
>>>>
>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote:
>>>>> Hi James,
>>>>>
>>>>> Thanks very much for your help.
>>>>> There is an issue that needs to be solved before thinking about what's
>>>>> the best approach in my opinion.
>>>>>
>>>>> I don't understand why, but the object created with the call to select
>>>>> (test in my example, first.two in yours) has a different number of
>>>>> rows from the original object (df in my example). Specifically it has
>>>>> *less* rows.
>>>
>>> I'm surprised it has less rows. It can definitely have more, when some
>>> of the keys passed to select() are mapped to more than 1 row, but my
>>> understanding was that select() would propagate unmapped keys to the
>>> output by placing them in rows stuffed with NAs. So maybe I
>>> misunderstood how select() works, or its behavior was changed, or
>>> there is a bug somewhere. Could you please send the code that allows
>>> us to reproduce this? Thanks.
>>>
>>> H.
>>>
>>>
>>>> If all symbols were converted to all possible Entrez IDs,
>>>>> I would expect it to have more rows, not less. To me, it looks like
>>>>> not all rows are looked up and returned.
>>>>>
>>>>> Do you see what I mean?
>>>>
>>>> Sure. You could be using outdated gene symbols. Or perhaps you are
>>>> using
>>>> a mixture of symbols and aliases. Which is even cooler than just all
>>>> symbols:
>>>>
>>>>   >  symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10],
>>>> Rkeys(org.Hs.egALIAS2EG)[31:45])
>>>>   >  symb
>>>>    [1] "A1BG"     "A2M"      "A2MP1"    "NAT1"     "NAT2"     "AACP"
>>>>    [7] "SERPINA3" "AADAC"    "AAMP"     "AANAT"    "AAMP"     "AANAT"
>>>> [13] "DSPS"     "SNAT"     "AARS"     "CMT2N"    "AAV"      "AAVS1"
>>>> [19] "ABAT"     "GABA-AT"  "GABAT"    "NPD009"   "ABC-1"    "ABC1"
>>>> [25] "ABCA1"
>>>>   >  select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL")
>>>>        SYMBOL ENTREZID
>>>> 1      A1BG        1
>>>> 2       A2M        2
>>>> 3     A2MP1        3
>>>> 4      NAT1        9
>>>> 5      NAT2       10
>>>> 6      AACP       11
>>>> 7  SERPINA3       12
>>>> 8     AADAC       13
>>>> 9      AAMP       14
>>>> 10    AANAT       15
>>>> 11     AAMP       14
>>>> 12    AANAT       15
>>>> 13     DSPS<NA>
>>>> 14     SNAT<NA>
>>>> 15     AARS       16
>>>> 16    CMT2N<NA>
>>>> 17      AAV<NA>
>>>> 18    AAVS1       17
>>>> 19     ABAT       18
>>>> 20  GABA-AT<NA>
>>>> 21    GABAT<NA>
>>>> 22   NPD009<NA>
>>>> 23    ABC-1<NA>
>>>> 24     ABC1<NA>
>>>> 25    ABCA1       19
>>>>   >  select(org.Hs.eg.db, symb, "ENTREZID","ALIAS")
>>>>         ALIAS ENTREZID
>>>> 1      A1BG        1
>>>> 2       A2M        2
>>>> 3     A2MP1        3
>>>> 4      NAT1        9
>>>> 5      NAT1     1982
>>>> 6      NAT1     6530
>>>> 7      NAT1    10991
>>>> 8      NAT2       10
>>>> 9      NAT2    81539
>>>> 10     AACP       11
>>>> 11 SERPINA3       12
>>>> 12    AADAC       13
>>>> 13     AAMP       14
>>>> 14    AANAT       15
>>>> 15     DSPS       15
>>>> 16     SNAT       15
>>>> 17     AARS       16
>>>> 18    CMT2N       16
>>>> 19      AAV       17
>>>> 20    AAVS1       17
>>>> 21     ABAT       18
>>>> 22  GABA-AT       18
>>>> 23    GABAT       18
>>>> 24   NPD009       18
>>>> 25    ABC-1       19
>>>> 26     ABC1       19
>>>> 27     ABC1    63897
>>>> 28    ABCA1       19
>>>> Warning message:
>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>     'select' and duplicate query keys resulted in 1:many mapping
>>>> between
>>>> keys and return rows
>>>>   >  mget(c("1982","6530","10991"), org.Hs.egGENENAME)
>>>> $`1982`
>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2"
>>>>
>>>> $`6530`
>>>> [1] "solute carrier family 6 (neurotransmitter transporter,
>>>> noradrenalin), member 2"
>>>>
>>>> $`10991`
>>>> [1] "solute carrier family 38, member 3"
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at uw.edu>   wrote:
>>>>>> Hi Enrico,
>>>>>>
>>>>>>
>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote:
>>>>>>> Dear James,
>>>>>>>
>>>>>>> Thanks very much for your prompt reply.
>>>>>>> I knew the problem was the for loop and the select function is
>>>>>>> indeed
>>>>>>> a lot faster than that and works perfectly with toy data.
>>>>>>>
>>>>>>> However, this is what happens when I try to use it with real data:
>>>>>>>
>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS",
>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>>>>>> Warning message:
>>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>>>      'select' and duplicate query keys resulted in 1:many mapping
>>>>>>> between
>>>>>>> keys and return rows
>>>>>>>
>>>>>>> which is probably the warning you mentioned.
>>>>>>
>>>>>> That's not the warning I mentioned, but it does point out the same
>>>>>> issue,
>>>>>> which is that there is a one to many mapping between symbol and
>>>>>> entrez gene
>>>>>> ID.
>>>>>>
>>>>>> So now you have to decide if you want to be naive (or stupid,
>>>>>> depending on
>>>>>> your perspective) or not. You could just cover your eyes and do this:
>>>>>>
>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),]
>>>>>>
>>>>>> which will choose for you the first symbol ->   gene ID mapping and
>>>>>> nuke the
>>>>>> rest. That's nice and quick, but you are making huge assumptions.
>>>>>>
>>>>>> Or you could decide to be a bit more sophisticated and do
>>>>>> something like
>>>>>>
>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x)
>>>>>> first.two[x,])
>>>>>>
>>>>>> At this point you can take a look at e.g., thelst[1:10] to see what
>>>>>> we just
>>>>>> did
>>>>>>
>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1],
>>>>>> paste(x[,2],
>>>>>> collapse = "|")))
>>>>>>
>>>>>> and here you can look at head(thelst).
>>>>>>
>>>>>> Then you can check to ensure that the first column of thelst is
>>>>>> identical to
>>>>>> the first column of df, and proceed as before.
>>>>>>
>>>>>> But there is still the problem of the multiple mappings. As an
>>>>>> example:
>>>>>>
>>>>>>> thelst[1:5]
>>>>>> $HBD
>>>>>>        SYMBOL  ENTREZID
>>>>>> 2535    HBD      3045
>>>>>> 2536    HBD 100187828
>>>>>>
>>>>>> $KIR3DL3
>>>>>>          SYMBOL  ENTREZID
>>>>>> 17513 KIR3DL3    115653
>>>>>> 17514 KIR3DL3 100133046
>>>>>>
>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME)
>>>>>> $`3045`
>>>>>> [1] "hemoglobin, delta"
>>>>>>
>>>>>> $`100187828`
>>>>>> [1] "hypophosphatemic bone disease"
>>>>>>
>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME)
>>>>>> $`115653`
>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, long
>>>>>> cytoplasmic tail, 3"
>>>>>>
>>>>>> $`100133046`
>>>>>> [1] "killer cell immunoglobulin-like receptor three domains long
>>>>>> cytoplasmic
>>>>>> tail 3"
>>>>>>
>>>>>>
>>>>>> So HBD is the gene symbol for two different genes! If this gene
>>>>>> symbol is in
>>>>>> your data, you will now have attributed your data to two genes that
>>>>>> apparently are not remotely similar. if KIR3DL3 is in your data,
>>>>>> then it
>>>>>> worked out OK for that gene.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> The real problem is that the number of rows is now different for
>>>>>>> the 2
>>>>>>> objects:
>>>>>>>> nrow(df); nrow(test)
>>>>>>> [1] 573
>>>>>>> [1] 201
>>>>>>>
>>>>>>> So I obviously can't put the new data into the original df. My
>>>>>>> impression is that when the 1 to many mapping arises, the select
>>>>>>> functions exits, with that warning message. As a result, my test
>>>>>>> object is incomplete.
>>>>>>>
>>>>>>> On top of that, and I can't really explain this, the row
>>>>>>> positions are
>>>>>>> messed up, e.g.
>>>>>>>
>>>>>>>> all.equal(df[100,],test[100,])
>>>>>>> returns FALSE.
>>>>>>>
>>>>>>> How can I work around this?
>>>>>>>
>>>>>>> Thanks a  lot!
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at uw.edu>    wrote:
>>>>>>>> Hi Enrico,
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I often have data frames where I need to perform ID conversions on
>>>>>>>>> one
>>>>>>>>> or
>>>>>>>>> more of the columns while preserving the order of the rows, e.g.:
>>>>>>>>>
>>>>>>>>> GeneSymbol    Value1    Value2
>>>>>>>>> GS1    2.5    0.1
>>>>>>>>> GS2    3    0.2
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>> And I want to obtain:
>>>>>>>>>
>>>>>>>>> GeneSymbol    EntrezGeneID    Value1    Value2
>>>>>>>>> GS1    EG1    2.5    0.1
>>>>>>>>> GS2    EG2    3    0.2
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>> What I've done so far was to create a function that uses
>>>>>>>>> org.Hs.eg.db to
>>>>>>>>> loop over the rows of the column and does the conversion:
>>>>>>>>>
>>>>>>>>> library(org.Hs.eg.db)
>>>>>>>>> alias2EG<- function(x) {
>>>>>>>>> for (i in 1:length(x)) {
>>>>>>>>> if (!is.na(x[i])) {
>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
>>>>>>>>> if (!is.null(repl)) {
>>>>>>>>> x[i]<- repl
>>>>>>>>> }
>>>>>>>>> else {
>>>>>>>>> x[i]<- NA
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> return(x)
>>>>>>>>> }
>>>>>>>>
>>>>>>>> I should first note that gene symbols are not unique, so you are
>>>>>>>> taking a
>>>>>>>> chance on your mappings. Is there no other annotation for your
>>>>>>>> data?
>>>>>>>>
>>>>>>>> In addition, you should note that it is almost always better to
>>>>>>>> think of
>>>>>>>> objects as vectors and matrices in R, rather than as things that
>>>>>>>> need to
>>>>>>>> be
>>>>>>>> looped over (e.g., R isn't Perl or C).
>>>>>>>>
>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol),
>>>>>>>> "ENTREZID",
>>>>>>>> "SYMBOL")
>>>>>>>>
>>>>>>>> Note that there used to be a warning or an error (don't remember
>>>>>>>> which)
>>>>>>>> when
>>>>>>>> you did something like this, stating that gene symbols are not
>>>>>>>> unique,
>>>>>>>> and
>>>>>>>> that you shouldn't do this sort of thing. Apparently this
>>>>>>>> warning has
>>>>>>>> been
>>>>>>>> removed, but the issue remains valid.
>>>>>>>>
>>>>>>>> ## check yourself
>>>>>>>>
>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL)
>>>>>>>>
>>>>>>>> ## if true, proceed
>>>>>>>>
>>>>>>>> df<- data.frame(first.two, df[,-1])
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Jim
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> and then call the function like this:
>>>>>>>>>
>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>>>>>>>>>
>>>>>>>>> This works well, but gets very slow when I need to do multiple
>>>>>>>>> conversions
>>>>>>>>> on large datasets.
>>>>>>>>>
>>>>>>>>> Is there any way I can achieve the same result but in a
>>>>>>>>> quicker, more
>>>>>>>>> efficient way?
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>> --
>>>>>>>> James W. MacDonald, M.S.
>>>>>>>> Biostatistician
>>>>>>>> University of Washington
>>>>>>>> Environmental and Occupational Health Sciences
>>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>>> Seattle WA 98105-6099
>>>>>>>>
>>>>>> --
>>>>>> James W. MacDonald, M.S.
>>>>>> Biostatistician
>>>>>> University of Washington
>>>>>> Environmental and Occupational Health Sciences
>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>> Seattle WA 98105-6099
>>>>>>
>>>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list