[BioC] Quickest way to convert IDs in a data frame?
Hervé Pagès
hpages at fhcrc.org
Thu Jul 25 23:36:07 CEST 2013
On 07/25/2013 02:06 PM, Enrico Ferrero wrote:
> Hi,
>
> Hervé, that's exactly what I'm trying to say.
>
> Attached to this email is a tab delimited file with two columns of
> GeneSymbols (or Aliases), and here is some simple code to reproduce
> the unexpected behaviour:
>
> library(org.Hs.eg.db)
> mydf <- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE)
> mytest <- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS",
> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
> # check that mytest has less rows than mydf
> nrow(mydf)
> nrow(mytest)
I see. You have a lot of NAs in the vector of keys you're passing
to select():
> table(is.na(mydf$GeneSymbol1))
FALSE TRUE
1018 64
As indicated by the warning I get (you should have gotten it too),
select() removes those NAs from the input:
> mytest <- select(org.Hs.eg.db, key=mydf$GeneSymbol1,
keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL"))
Warning messages:
1: In .select(x, keys, cols, keytype, jointype = jointype) :
'NA' keys have been removed
2: In .generateExtraRows(tab, keys, jointype) :
'select' and duplicate query keys resulted in 1:many mapping
between keys and return rows
which explains why the output has less rows than the length of the
input. Trying this again with a vector of keys of length 3:
> select(org.Hs.eg.db, key=c("HTR1E", NA, "HTR1F"), keytype="ALIAS",
cols=c("SYMBOL","ENTREZID","ENSEMBL"))
ALIAS SYMBOL ENTREZID ENSEMBL
1 HTR1E HTR1E 3354 ENSG00000168830
2 HTR1F HTR1F 3355 ENSG00000179097
Warning message:
In .select(x, keys, cols, keytype, jointype = jointype) :
'NA' keys have been removed
> # pick a random row: they don't match
> mydf[250,]
> mytest[250,]
As a general rule, you cannot expect the rows of the output produced
by select() to match the vector of keys passed to it. Unless you know
your keys are mapped to at most 1 value but it's not the case here.
Preserving positions between the input and output of select() could have
been achieved by returning a DataFrame instead of a data.frame, and by
using list-like columns, but I think what drove the current design was
the desire to keep the returned object as simple as possible.
Cheers,
H.
>
> Ideally, mytest should have the same number and position of rows of
> mydf so that I can then cbind them.
> If mytest has more rows because of multiple mappings that's also fine:
> I can always use merge(mydf, mytest), right?
>
> Thanks a lot to both for your help, it's very appreciated.
> Best,
>
>
> On 25 July 2013 21:32, Hervé Pagès <hpages at fhcrc.org> wrote:
>> Hi Enrico,
>>
>>
>> On 07/25/2013 01:20 PM, James W. MacDonald wrote:
>>>
>>> Hi Enrico,
>>>
>>> Please don't take things off-list (e.g., use reply-all).
>>>
>>>
>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote:
>>>>
>>>> Hi James,
>>>>
>>>> Thanks very much for your help.
>>>> There is an issue that needs to be solved before thinking about what's
>>>> the best approach in my opinion.
>>>>
>>>> I don't understand why, but the object created with the call to select
>>>> (test in my example, first.two in yours) has a different number of
>>>> rows from the original object (df in my example). Specifically it has
>>>> *less* rows.
>>
>>
>> I'm surprised it has less rows. It can definitely have more, when some
>> of the keys passed to select() are mapped to more than 1 row, but my
>> understanding was that select() would propagate unmapped keys to the
>> output by placing them in rows stuffed with NAs. So maybe I
>> misunderstood how select() works, or its behavior was changed, or
>> there is a bug somewhere. Could you please send the code that allows
>> us to reproduce this? Thanks.
>>
>> H.
>>
>>
>>> If all symbols were converted to all possible Entrez IDs,
>>>>
>>>> I would expect it to have more rows, not less. To me, it looks like
>>>> not all rows are looked up and returned.
>>>>
>>>> Do you see what I mean?
>>>
>>>
>>> Sure. You could be using outdated gene symbols. Or perhaps you are using
>>> a mixture of symbols and aliases. Which is even cooler than just all
>>> symbols:
>>>
>>> > symb <- c(Rkeys(org.Hs.egSYMBOL)[1:10],
>>> Rkeys(org.Hs.egALIAS2EG)[31:45])
>>> > symb
>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP"
>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT"
>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1"
>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1"
>>> [25] "ABCA1"
>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL")
>>> SYMBOL ENTREZID
>>> 1 A1BG 1
>>> 2 A2M 2
>>> 3 A2MP1 3
>>> 4 NAT1 9
>>> 5 NAT2 10
>>> 6 AACP 11
>>> 7 SERPINA3 12
>>> 8 AADAC 13
>>> 9 AAMP 14
>>> 10 AANAT 15
>>> 11 AAMP 14
>>> 12 AANAT 15
>>> 13 DSPS <NA>
>>> 14 SNAT <NA>
>>> 15 AARS 16
>>> 16 CMT2N <NA>
>>> 17 AAV <NA>
>>> 18 AAVS1 17
>>> 19 ABAT 18
>>> 20 GABA-AT <NA>
>>> 21 GABAT <NA>
>>> 22 NPD009 <NA>
>>> 23 ABC-1 <NA>
>>> 24 ABC1 <NA>
>>> 25 ABCA1 19
>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS")
>>> ALIAS ENTREZID
>>> 1 A1BG 1
>>> 2 A2M 2
>>> 3 A2MP1 3
>>> 4 NAT1 9
>>> 5 NAT1 1982
>>> 6 NAT1 6530
>>> 7 NAT1 10991
>>> 8 NAT2 10
>>> 9 NAT2 81539
>>> 10 AACP 11
>>> 11 SERPINA3 12
>>> 12 AADAC 13
>>> 13 AAMP 14
>>> 14 AANAT 15
>>> 15 DSPS 15
>>> 16 SNAT 15
>>> 17 AARS 16
>>> 18 CMT2N 16
>>> 19 AAV 17
>>> 20 AAVS1 17
>>> 21 ABAT 18
>>> 22 GABA-AT 18
>>> 23 GABAT 18
>>> 24 NPD009 18
>>> 25 ABC-1 19
>>> 26 ABC1 19
>>> 27 ABC1 63897
>>> 28 ABCA1 19
>>> Warning message:
>>> In .generateExtraRows(tab, keys, jointype) :
>>> 'select' and duplicate query keys resulted in 1:many mapping between
>>> keys and return rows
>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME)
>>> $`1982`
>>> [1] "eukaryotic translation initiation factor 4 gamma, 2"
>>>
>>> $`6530`
>>> [1] "solute carrier family 6 (neurotransmitter transporter,
>>> noradrenalin), member 2"
>>>
>>> $`10991`
>>> [1] "solute carrier family 38, member 3"
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>>
>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at uw.edu> wrote:
>>>>>
>>>>> Hi Enrico,
>>>>>
>>>>>
>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote:
>>>>>>
>>>>>> Dear James,
>>>>>>
>>>>>> Thanks very much for your prompt reply.
>>>>>> I knew the problem was the for loop and the select function is indeed
>>>>>> a lot faster than that and works perfectly with toy data.
>>>>>>
>>>>>> However, this is what happens when I try to use it with real data:
>>>>>>
>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS",
>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>>>>>
>>>>>> Warning message:
>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>> 'select' and duplicate query keys resulted in 1:many mapping
>>>>>> between
>>>>>> keys and return rows
>>>>>>
>>>>>> which is probably the warning you mentioned.
>>>>>
>>>>>
>>>>> That's not the warning I mentioned, but it does point out the same
>>>>> issue,
>>>>> which is that there is a one to many mapping between symbol and
>>>>> entrez gene
>>>>> ID.
>>>>>
>>>>> So now you have to decide if you want to be naive (or stupid,
>>>>> depending on
>>>>> your perspective) or not. You could just cover your eyes and do this:
>>>>>
>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),]
>>>>>
>>>>> which will choose for you the first symbol -> gene ID mapping and
>>>>> nuke the
>>>>> rest. That's nice and quick, but you are making huge assumptions.
>>>>>
>>>>> Or you could decide to be a bit more sophisticated and do something like
>>>>>
>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x)
>>>>> first.two[x,])
>>>>>
>>>>> At this point you can take a look at e.g., thelst[1:10] to see what
>>>>> we just
>>>>> did
>>>>>
>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1],
>>>>> paste(x[,2],
>>>>> collapse = "|")))
>>>>>
>>>>> and here you can look at head(thelst).
>>>>>
>>>>> Then you can check to ensure that the first column of thelst is
>>>>> identical to
>>>>> the first column of df, and proceed as before.
>>>>>
>>>>> But there is still the problem of the multiple mappings. As an example:
>>>>>
>>>>>> thelst[1:5]
>>>>>
>>>>> $HBD
>>>>> SYMBOL ENTREZID
>>>>> 2535 HBD 3045
>>>>> 2536 HBD 100187828
>>>>>
>>>>> $KIR3DL3
>>>>> SYMBOL ENTREZID
>>>>> 17513 KIR3DL3 115653
>>>>> 17514 KIR3DL3 100133046
>>>>>
>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME)
>>>>>
>>>>> $`3045`
>>>>> [1] "hemoglobin, delta"
>>>>>
>>>>> $`100187828`
>>>>> [1] "hypophosphatemic bone disease"
>>>>>
>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME)
>>>>>
>>>>> $`115653`
>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, long
>>>>> cytoplasmic tail, 3"
>>>>>
>>>>> $`100133046`
>>>>> [1] "killer cell immunoglobulin-like receptor three domains long
>>>>> cytoplasmic
>>>>> tail 3"
>>>>>
>>>>>
>>>>> So HBD is the gene symbol for two different genes! If this gene
>>>>> symbol is in
>>>>> your data, you will now have attributed your data to two genes that
>>>>> apparently are not remotely similar. if KIR3DL3 is in your data, then it
>>>>> worked out OK for that gene.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> The real problem is that the number of rows is now different for the 2
>>>>>> objects:
>>>>>>>
>>>>>>> nrow(df); nrow(test)
>>>>>>
>>>>>> [1] 573
>>>>>> [1] 201
>>>>>>
>>>>>> So I obviously can't put the new data into the original df. My
>>>>>> impression is that when the 1 to many mapping arises, the select
>>>>>> functions exits, with that warning message. As a result, my test
>>>>>> object is incomplete.
>>>>>>
>>>>>> On top of that, and I can't really explain this, the row positions are
>>>>>> messed up, e.g.
>>>>>>
>>>>>>> all.equal(df[100,],test[100,])
>>>>>>
>>>>>> returns FALSE.
>>>>>>
>>>>>> How can I work around this?
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at uw.edu> wrote:
>>>>>>>
>>>>>>> Hi Enrico,
>>>>>>>
>>>>>>>
>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I often have data frames where I need to perform ID conversions on
>>>>>>>> one
>>>>>>>> or
>>>>>>>> more of the columns while preserving the order of the rows, e.g.:
>>>>>>>>
>>>>>>>> GeneSymbol Value1 Value2
>>>>>>>> GS1 2.5 0.1
>>>>>>>> GS2 3 0.2
>>>>>>>> ..
>>>>>>>>
>>>>>>>> And I want to obtain:
>>>>>>>>
>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2
>>>>>>>> GS1 EG1 2.5 0.1
>>>>>>>> GS2 EG2 3 0.2
>>>>>>>> ..
>>>>>>>>
>>>>>>>> What I've done so far was to create a function that uses
>>>>>>>> org.Hs.eg.db to
>>>>>>>> loop over the rows of the column and does the conversion:
>>>>>>>>
>>>>>>>> library(org.Hs.eg.db)
>>>>>>>> alias2EG<- function(x) {
>>>>>>>> for (i in 1:length(x)) {
>>>>>>>> if (!is.na(x[i])) {
>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
>>>>>>>> if (!is.null(repl)) {
>>>>>>>> x[i]<- repl
>>>>>>>> }
>>>>>>>> else {
>>>>>>>> x[i]<- NA
>>>>>>>> }
>>>>>>>> }
>>>>>>>> }
>>>>>>>> return(x)
>>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> I should first note that gene symbols are not unique, so you are
>>>>>>> taking a
>>>>>>> chance on your mappings. Is there no other annotation for your data?
>>>>>>>
>>>>>>> In addition, you should note that it is almost always better to
>>>>>>> think of
>>>>>>> objects as vectors and matrices in R, rather than as things that
>>>>>>> need to
>>>>>>> be
>>>>>>> looped over (e.g., R isn't Perl or C).
>>>>>>>
>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol),
>>>>>>> "ENTREZID",
>>>>>>> "SYMBOL")
>>>>>>>
>>>>>>> Note that there used to be a warning or an error (don't remember
>>>>>>> which)
>>>>>>> when
>>>>>>> you did something like this, stating that gene symbols are not unique,
>>>>>>> and
>>>>>>> that you shouldn't do this sort of thing. Apparently this warning has
>>>>>>> been
>>>>>>> removed, but the issue remains valid.
>>>>>>>
>>>>>>> ## check yourself
>>>>>>>
>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL)
>>>>>>>
>>>>>>> ## if true, proceed
>>>>>>>
>>>>>>> df<- data.frame(first.two, df[,-1])
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> and then call the function like this:
>>>>>>>>
>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>>>>>>>>
>>>>>>>> This works well, but gets very slow when I need to do multiple
>>>>>>>> conversions
>>>>>>>> on large datasets.
>>>>>>>>
>>>>>>>> Is there any way I can achieve the same result but in a quicker, more
>>>>>>>> efficient way?
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>> --
>>>>>>> James W. MacDonald, M.S.
>>>>>>> Biostatistician
>>>>>>> University of Washington
>>>>>>> Environmental and Occupational Health Sciences
>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>> Seattle WA 98105-6099
>>>>>>>
>>>>>>
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> University of Washington
>>>>> Environmental and Occupational Health Sciences
>>>>> 4225 Roosevelt Way NE, # 100
>>>>> Seattle WA 98105-6099
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list