[BioC] Quickest way to convert IDs in a data frame?
Hervé Pagès
hpages at fhcrc.org
Tue Jul 30 00:14:09 CEST 2013
Hi Marc,
On 07/29/2013 02:29 PM, Marc Carlson wrote:
> Well when I initially implemented it, that is exactly what it did.
>
> But then others said that they had many lists of IDs and that they
> wanted to have any repeated keys respected as a valid input, so that the
> output would match up with what was initially asked for. And I said to
> these people that while I could do that, it would ONLY really work in
> cases where everything asked for mapped 1:1 with the initial set of keys
> coming in.
>
> So this special 1st case is just here for convenience. If you call
> select() and it does not throw any warnings, then you know that you can
> just cbind() the results onto your starting keys.
>
> But if you see that warning it means that your data has a different
> shape than your keys started as.
It's good to have the warning for interactive use. Thanks!
It's not so convenient when select() needs to be integrated in robust
code e.g. in a pipeline. Since the developer of the pipeline cannot
assume that the data.frame will have the same shape as the keys, then
s/he needs to be able to handle the general case, or, if s/he doesn't
want to do that, s/he needs to raise an error in case select() returns
a 1:many mapping. Having the warning doesn't make it easy/natural to
detect this situation programmatically though.
That brings me back to my original proposal of having an extra arg
e.g. 'if.multiple.mapping' that would support "all", "arbitrary", "none"
and now "error". Only "all" and "error" could be supported as a first
step. I think that would already be very useful. Just throwing some
ideas here. I know you're already very busy and I don't mean to put
more things in your plate.
Thanks!
H.
>
>
> Marc
>
>
> On 07/29/2013 11:32 AM, Enrico Ferrero wrote:
>> Hi Marc,
>>
>> Thanks for taking the time to explain this thoroughly, it is now clear
>> and my original problem can be considered solved.
>>
>> For the sake of discussion, I'd just like to add something (which
>> doesn't necessarily need a follow up): at least to me, select() seems
>> to have a rather arbitrary behaviour. I don't completely see the point
>> in returning 4 rows in the 1st scenario then. Why not just return 1
>> row and have a consistent behaviour all around (remove all NAs and all
>> duplicates in the input)?
>>
>> Best,
>>
>> On 29 July 2013 19:00, Marc Carlson <mcarlson at fhcrc.org> wrote:
>>> On 07/27/2013 06:40 AM, Enrico Ferrero wrote:
>>>> Hi everybody,
>>>>
>>>> Marc, thanks for clarifying things. The behaviour of the select()
>>>> function is absolutely sensible. Maybe it should be made explicit
>>>> somewhere in the documentation that, when working with data frames,
>>>> the user is expected to use the merge() function in conjunction with
>>>> it. I also agree with Hervé that having options to tweak and customize
>>>> the output would be an extremely positive thing and a step in the
>>>> right direction. In addition to a "select" argument, one can also
>>>> think of a "remove.na.rows" that evaluates to either TRUE or FALSE.
>>>> But then again, using merge() after select() already deals with these
>>>> issues quite well.
>>>>
>>>> What I think should be investigated more closely at the moment is the
>>>> unexpected behaviour select() exhibits when one SYMBOL or ALIAS (and
>>>> potentially other types of ID, I don't know) maps to more than one
>>>> ENTREZID. As exemplified by James' code below, this causes the output
>>>> to be truncated, and I highly doubt this is intentional:
>>>>
>>>>> select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS")
>>>> ALIAS ENTREZID
>>>> 1 ADORA2A 135
>>>> 2 ADORA2A 135
>>>> 3 ADORA2A 135
>>>> 4 ADORA2A 135
>>>>
>>>>> select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS")
>>>> ALIAS ENTREZID
>>>> 1 AGT 183
>>>> 2 AGT 189
>>>>
>>>> Warning message:
>>>> In .generateExtraRows(tab, keys, jointype) :
>>>> 'select' and duplicate query keys resulted in 1:many mapping
>>>> between
>>>> keys and return rows
>>>>
>>>>
>>>> It would be great to have your views on this.
>>>> Best,
>>>
>>> Hi Enrico,
>>>
>>> My view on this is the same one that I presented above. Basically
>>> you seem
>>> to have misunderstood what select is doing in this case. So clearly I
>>> need
>>> to explain things a bit better in the documentation. But what is
>>> happening
>>> is in fact completely intentional, and it happens every time there is at
>>> least one "many to one" relationship requested by select. The
>>> presence of a
>>> many to one relationship means that select no longer has any chance of
>>> giving you a data.frame back that has the same height as the length
>>> of your
>>> keys. So instead of attempting to keep your repeated keys and
>>> matching them
>>> perfectly (which is no longer possible), select assumes that you know
>>> what
>>> you are doing and instead it just simplifies the result by removing all
>>> duplicated rows from the result. This is why your result appears
>>> "truncated". It's because there really was no point in keeping the
>>> initial
>>> pattern from the keys you passed in (as the data shape makes it
>>> impossible
>>> to do this anyways).
>>>
>>> The result you get in your 2nd case is actually the same exact
>>> information
>>> content as if we had tried to duplicate rows to match your repeated
>>> input.
>>> The only actual difference here is that there is no way for select to
>>> know
>>> how you intended to repeat the symbol "AGT" to match the two entrez
>>> gene IDs
>>> to the initial four "AGT" symbols that you passed in. For this
>>> example, did
>>> you want AGT repeated 4 times (with two repeats each of the two
>>> entrez gene
>>> IDs)? Or did you maybe want it repeated 8 times (with 4 repeats of each
>>> entrez gene ID)? And what should we have done if you had repeated the
>>> symbol "AGT" 5 times in the input instead? How are we supposed to
>>> format
>>> the output in that case? I hope you can see why in this case we have to
>>> just give you the data as it is. In this circumstance we just can't
>>> guess
>>> anymore about how you want it presented. So instead of guessing we just
>>> return all the data "as is" and give you a warning. So it's not
>>> actually
>>> true that the 2nd case you presented is "truncated". It's actually true
>>> instead that the 1st case data has just been repeated in an effort to
>>> make
>>> your life easier. But when the data is complicated by many to one
>>> relationships, we just can't know anymore what you will want to do for
>>> formatting it.
>>>
>>> We have tried to be very accommodating with select for people who
>>> request
>>> simple 1:1 relationships because we recognize that this is a common
>>> use case
>>> and we can see a straightforward way to make things easier for that
>>> common
>>> use case. But select is not really meant to be a data formatting
>>> function.
>>> It's really intended to be a data retrieval function. R already has
>>> a lot
>>> of great functions for data formatting already (like merge and the
>>> subset
>>> operators etc.), and these are already more flexible and better
>>> suited for
>>> tasks like that.
>>>
>>>
>>>
>>> Marc
>>>
>>>
>>>
>>>
>>>
>>>> On 26 July 2013 21:46, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>> Hi Marc,
>>>>>
>>>>> On 07/26/2013 12:57 PM, Marc Carlson wrote:
>>>>> ...
>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> Sorry that I saw this thread so late. Basically, select() does
>>>>>> *try* to
>>>>>> keep your initial keys and map them each to an equivalent number of
>>>>>> unique values. We did actually anticipate that people would
>>>>>> *want* to
>>>>>> cbind() their results.
>>>>>>
>>>>>> But as you discovered there are many circumstances where the data
>>>>>> make
>>>>>> this kind of behavior impossible.
>>>>>>
>>>>>> So passing in NAs as keys for example can't ever find anything
>>>>>> meaningful. Those will simply have to be removed before we can
>>>>>> proceed. And, it is also impossible to maintain a 1:1 mapping if you
>>>>>> retrieve fields that have many to one relationships with your initial
>>>>>> keys (also seen here).
>>>>>>
>>>>>> For convenience, when this kind of 1:1 output is already
>>>>>> impossible (as
>>>>>> it is for most of your examples), select will also try to simplify
>>>>>> the
>>>>>> output by removing rows that are identical all the way across etc..
>>>>>>
>>>>>> My aim was that select should try to do the most reasonable thing
>>>>>> possible based on the data we have in each case. The rationale is
>>>>>> that
>>>>>> in the case where there are 1:many mappings, you should not be
>>>>>> planning
>>>>>> to bind those directly onto any other data.frames anyways (as this
>>>>>> circumstance would require you to call merge() instead). So in that
>>>>>> case, non-destructive simplification seems beneficial.
>>>>>
>>>>> Other tools in our infrastructure use an extra argument to pick-up 1
>>>>> thing in case of multiple mapping e.g. findOverlaps() has the 'select'
>>>>> argument with possible values "all", "first", "last", and "arbitrary".
>>>>> Also nearest() and family have this argument and it accepts similar
>>>>> values.
>>>>>
>>>>> Couldn't select() use a similar approach? The default should be "all"
>>>>> so the current behavior is preserved but if it's something else then
>>>>> the returned data.frame should align with the input.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> H.
>>>>>
>>>>>
>>>>>> I hope this clarifies things,
>>>>>>
>>>>>>
>>>>>> Marc
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> As I
>>>>>>>> mentioned in my first post, the for loop function works, but it's
>>>>>>>> highly inefficient.
>>>>>>>>
>>>>>>>> Any help is greatly appreciated, thank you.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 25 July 2013 23:18, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>>>>>> Hi James,
>>>>>>>>>
>>>>>>>>> You're right.
>>>>>>>>>
>>>>>>>>> It's actually both: NAs *and* duplicated keys that are mapped to
>>>>>>>>> more than 1 row are removed from the input. I don't think this
>>>>>>>>> is documented.
>>>>>>>>>
>>>>>>>>> I wonder if select() behavior couldn't be a little bit simpler by
>>>>>>>>> either preserving or removing all duplicated keys, and not just
>>>>>>>>> some
>>>>>>>>> of them (on a somewhat arbitrary criteria).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> H.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 07/25/2013 02:57 PM, James W. MacDonald wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Enrico and Herve,
>>>>>>>>>>
>>>>>>>>>> This has to do with duplicate entries, but only when the
>>>>>>>>>> duplicate
>>>>>>>>>> entry
>>>>>>>>>> maps to many ENTREZID:
>>>>>>>>>>
>>>>>>>>>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID",
>>>>>>>>>> "ALIAS")
>>>>>>>>>> ALIAS ENTREZID
>>>>>>>>>> 1 ADORA2A 135
>>>>>>>>>> 2 ADORA2A 135
>>>>>>>>>> 3 ADORA2A 135
>>>>>>>>>> 4 ADORA2A 135
>>>>>>>>>>
>>>>>>>>>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS")
>>>>>>>>>> ALIAS ENTREZID
>>>>>>>>>> 1 AGT 183
>>>>>>>>>> 2 AGT 189
>>>>>>>>>> Warning message:
>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many
>>>>>>>>>> mapping
>>>>>>>>>> between
>>>>>>>>>> keys and return rows
>>>>>>>>>>
>>>>>>>>>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS")
>>>>>>>>>> ALIAS ENTREZID
>>>>>>>>>> 1 AGT 183
>>>>>>>>>> 2 AGT 189
>>>>>>>>>> Warning message:
>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>>>>>> 'select' resulted in 1:many mapping between keys and return
>>>>>>>>>> rows
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So in the instances where a gene symbol maps to more than one
>>>>>>>>>> ENTREZID,
>>>>>>>>>> the output gets truncated, whereas if it is a one-to-one
>>>>>>>>>> mapping, it
>>>>>>>>>> does not.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Jim
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Hervé, that's exactly what I'm trying to say.
>>>>>>>>>>>
>>>>>>>>>>> Attached to this email is a tab delimited file with two
>>>>>>>>>>> columns of
>>>>>>>>>>> GeneSymbols (or Aliases), and here is some simple code to
>>>>>>>>>>> reproduce
>>>>>>>>>>> the unexpected behaviour:
>>>>>>>>>>>
>>>>>>>>>>> library(org.Hs.eg.db)
>>>>>>>>>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE,
>>>>>>>>>>> as.is=TRUE)
>>>>>>>>>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1,
>>>>>>>>>>> keytype="ALIAS",
>>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>>>>>>>>>> # check that mytest has less rows than mydf
>>>>>>>>>>> nrow(mydf)
>>>>>>>>>>> nrow(mytest)
>>>>>>>>>>> # pick a random row: they don't match
>>>>>>>>>>> mydf[250,]
>>>>>>>>>>> mytest[250,]
>>>>>>>>>>>
>>>>>>>>>>> Ideally, mytest should have the same number and position of
>>>>>>>>>>> rows of
>>>>>>>>>>> mydf so that I can then cbind them.
>>>>>>>>>>> If mytest has more rows because of multiple mappings that's also
>>>>>>>>>>> fine:
>>>>>>>>>>> I can always use merge(mydf, mytest), right?
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot to both for your help, it's very appreciated.
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 25 July 2013 21:32, Hervé Pagès<hpages at fhcrc.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Enrico,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Enrico,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please don't take things off-list (e.g., use reply-all).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi James,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks very much for your help.
>>>>>>>>>>>>>> There is an issue that needs to be solved before thinking
>>>>>>>>>>>>>> about
>>>>>>>>>>>>>> what's
>>>>>>>>>>>>>> the best approach in my opinion.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't understand why, but the object created with the
>>>>>>>>>>>>>> call to
>>>>>>>>>>>>>> select
>>>>>>>>>>>>>> (test in my example, first.two in yours) has a different
>>>>>>>>>>>>>> number
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> rows from the original object (df in my example).
>>>>>>>>>>>>>> Specifically
>>>>>>>>>>>>>> it has
>>>>>>>>>>>>>> *less* rows.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'm surprised it has less rows. It can definitely have more,
>>>>>>>>>>>> when
>>>>>>>>>>>> some
>>>>>>>>>>>> of the keys passed to select() are mapped to more than 1
>>>>>>>>>>>> row, but
>>>>>>>>>>>> my
>>>>>>>>>>>> understanding was that select() would propagate unmapped
>>>>>>>>>>>> keys to
>>>>>>>>>>>> the
>>>>>>>>>>>> output by placing them in rows stuffed with NAs. So maybe I
>>>>>>>>>>>> misunderstood how select() works, or its behavior was
>>>>>>>>>>>> changed, or
>>>>>>>>>>>> there is a bug somewhere. Could you please send the code that
>>>>>>>>>>>> allows
>>>>>>>>>>>> us to reproduce this? Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> H.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> If all symbols were converted to all possible Entrez IDs,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would expect it to have more rows, not less. To me, it
>>>>>>>>>>>>>> looks
>>>>>>>>>>>>>> like
>>>>>>>>>>>>>> not all rows are looked up and returned.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you see what I mean?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sure. You could be using outdated gene symbols. Or perhaps you
>>>>>>>>>>>>> are
>>>>>>>>>>>>> using
>>>>>>>>>>>>> a mixture of symbols and aliases. Which is even cooler than
>>>>>>>>>>>>> just
>>>>>>>>>>>>> all
>>>>>>>>>>>>> symbols:
>>>>>>>>>>>>>
>>>>>>>>>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10],
>>>>>>>>>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45])
>>>>>>>>>>>>> > symb
>>>>>>>>>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2"
>>>>>>>>>>>>> "AACP"
>>>>>>>>>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP"
>>>>>>>>>>>>> "AANAT"
>>>>>>>>>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV"
>>>>>>>>>>>>> "AAVS1"
>>>>>>>>>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1"
>>>>>>>>>>>>> "ABC1"
>>>>>>>>>>>>> [25] "ABCA1"
>>>>>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL")
>>>>>>>>>>>>> SYMBOL ENTREZID
>>>>>>>>>>>>> 1 A1BG 1
>>>>>>>>>>>>> 2 A2M 2
>>>>>>>>>>>>> 3 A2MP1 3
>>>>>>>>>>>>> 4 NAT1 9
>>>>>>>>>>>>> 5 NAT2 10
>>>>>>>>>>>>> 6 AACP 11
>>>>>>>>>>>>> 7 SERPINA3 12
>>>>>>>>>>>>> 8 AADAC 13
>>>>>>>>>>>>> 9 AAMP 14
>>>>>>>>>>>>> 10 AANAT 15
>>>>>>>>>>>>> 11 AAMP 14
>>>>>>>>>>>>> 12 AANAT 15
>>>>>>>>>>>>> 13 DSPS<NA>
>>>>>>>>>>>>> 14 SNAT<NA>
>>>>>>>>>>>>> 15 AARS 16
>>>>>>>>>>>>> 16 CMT2N<NA>
>>>>>>>>>>>>> 17 AAV<NA>
>>>>>>>>>>>>> 18 AAVS1 17
>>>>>>>>>>>>> 19 ABAT 18
>>>>>>>>>>>>> 20 GABA-AT<NA>
>>>>>>>>>>>>> 21 GABAT<NA>
>>>>>>>>>>>>> 22 NPD009<NA>
>>>>>>>>>>>>> 23 ABC-1<NA>
>>>>>>>>>>>>> 24 ABC1<NA>
>>>>>>>>>>>>> 25 ABCA1 19
>>>>>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS")
>>>>>>>>>>>>> ALIAS ENTREZID
>>>>>>>>>>>>> 1 A1BG 1
>>>>>>>>>>>>> 2 A2M 2
>>>>>>>>>>>>> 3 A2MP1 3
>>>>>>>>>>>>> 4 NAT1 9
>>>>>>>>>>>>> 5 NAT1 1982
>>>>>>>>>>>>> 6 NAT1 6530
>>>>>>>>>>>>> 7 NAT1 10991
>>>>>>>>>>>>> 8 NAT2 10
>>>>>>>>>>>>> 9 NAT2 81539
>>>>>>>>>>>>> 10 AACP 11
>>>>>>>>>>>>> 11 SERPINA3 12
>>>>>>>>>>>>> 12 AADAC 13
>>>>>>>>>>>>> 13 AAMP 14
>>>>>>>>>>>>> 14 AANAT 15
>>>>>>>>>>>>> 15 DSPS 15
>>>>>>>>>>>>> 16 SNAT 15
>>>>>>>>>>>>> 17 AARS 16
>>>>>>>>>>>>> 18 CMT2N 16
>>>>>>>>>>>>> 19 AAV 17
>>>>>>>>>>>>> 20 AAVS1 17
>>>>>>>>>>>>> 21 ABAT 18
>>>>>>>>>>>>> 22 GABA-AT 18
>>>>>>>>>>>>> 23 GABAT 18
>>>>>>>>>>>>> 24 NPD009 18
>>>>>>>>>>>>> 25 ABC-1 19
>>>>>>>>>>>>> 26 ABC1 19
>>>>>>>>>>>>> 27 ABC1 63897
>>>>>>>>>>>>> 28 ABCA1 19
>>>>>>>>>>>>> Warning message:
>>>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many
>>>>>>>>>>>>> mapping
>>>>>>>>>>>>> between
>>>>>>>>>>>>> keys and return rows
>>>>>>>>>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME)
>>>>>>>>>>>>> $`1982`
>>>>>>>>>>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2"
>>>>>>>>>>>>>
>>>>>>>>>>>>> $`6530`
>>>>>>>>>>>>> [1] "solute carrier family 6 (neurotransmitter transporter,
>>>>>>>>>>>>> noradrenalin), member 2"
>>>>>>>>>>>>>
>>>>>>>>>>>>> $`10991`
>>>>>>>>>>>>> [1] "solute carrier family 38, member 3"
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jim
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at uw.edu>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Enrico,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Dear James,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks very much for your prompt reply.
>>>>>>>>>>>>>>>> I knew the problem was the for loop and the select
>>>>>>>>>>>>>>>> function is
>>>>>>>>>>>>>>>> indeed
>>>>>>>>>>>>>>>> a lot faster than that and works perfectly with toy data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, this is what happens when I try to use it with
>>>>>>>>>>>>>>>> real
>>>>>>>>>>>>>>>> data:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol,
>>>>>>>>>>>>>>>>> keytype="ALIAS",
>>>>>>>>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Warning message:
>>>>>>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>>>>>>>>>>>> 'select' and duplicate query keys resulted in
>>>>>>>>>>>>>>>> 1:many
>>>>>>>>>>>>>>>> mapping
>>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>> keys and return rows
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> which is probably the warning you mentioned.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That's not the warning I mentioned, but it does point out
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>>> which is that there is a one to many mapping between
>>>>>>>>>>>>>>> symbol and
>>>>>>>>>>>>>>> entrez gene
>>>>>>>>>>>>>>> ID.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So now you have to decide if you want to be naive (or
>>>>>>>>>>>>>>> stupid,
>>>>>>>>>>>>>>> depending on
>>>>>>>>>>>>>>> your perspective) or not. You could just cover your eyes and
>>>>>>>>>>>>>>> do this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> which will choose for you the first symbol -> gene ID
>>>>>>>>>>>>>>> mapping
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> nuke the
>>>>>>>>>>>>>>> rest. That's nice and quick, but you are making huge
>>>>>>>>>>>>>>> assumptions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Or you could decide to be a bit more sophisticated and do
>>>>>>>>>>>>>>> something like
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL,
>>>>>>>>>>>>>>> function(x)
>>>>>>>>>>>>>>> first.two[x,])
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> At this point you can take a look at e.g., thelst[1:10]
>>>>>>>>>>>>>>> to see
>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>> we just
>>>>>>>>>>>>>>> did
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x)
>>>>>>>>>>>>>>> c(x[1,1],
>>>>>>>>>>>>>>> paste(x[,2],
>>>>>>>>>>>>>>> collapse = "|")))
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and here you can look at head(thelst).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then you can check to ensure that the first column of
>>>>>>>>>>>>>>> thelst is
>>>>>>>>>>>>>>> identical to
>>>>>>>>>>>>>>> the first column of df, and proceed as before.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But there is still the problem of the multiple mappings.
>>>>>>>>>>>>>>> As an
>>>>>>>>>>>>>>> example:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> thelst[1:5]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> $HBD
>>>>>>>>>>>>>>> SYMBOL ENTREZID
>>>>>>>>>>>>>>> 2535 HBD 3045
>>>>>>>>>>>>>>> 2536 HBD 100187828
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> $KIR3DL3
>>>>>>>>>>>>>>> SYMBOL ENTREZID
>>>>>>>>>>>>>>> 17513 KIR3DL3 115653
>>>>>>>>>>>>>>> 17514 KIR3DL3 100133046
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> $`3045`
>>>>>>>>>>>>>>> [1] "hemoglobin, delta"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> $`100187828`
>>>>>>>>>>>>>>> [1] "hypophosphatemic bone disease"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> $`115653`
>>>>>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor, three
>>>>>>>>>>>>>>> domains,
>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>> cytoplasmic tail, 3"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> $`100133046`
>>>>>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains
>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>> cytoplasmic
>>>>>>>>>>>>>>> tail 3"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So HBD is the gene symbol for two different genes! If
>>>>>>>>>>>>>>> this gene
>>>>>>>>>>>>>>> symbol is in
>>>>>>>>>>>>>>> your data, you will now have attributed your data to two
>>>>>>>>>>>>>>> genes
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> apparently are not remotely similar. if KIR3DL3 is in your
>>>>>>>>>>>>>>> data,
>>>>>>>>>>>>>>> then it
>>>>>>>>>>>>>>> worked out OK for that gene.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jim
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The real problem is that the number of rows is now
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> the 2
>>>>>>>>>>>>>>>> objects:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> nrow(df); nrow(test)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1] 573
>>>>>>>>>>>>>>>> [1] 201
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So I obviously can't put the new data into the original
>>>>>>>>>>>>>>>> df. My
>>>>>>>>>>>>>>>> impression is that when the 1 to many mapping arises, the
>>>>>>>>>>>>>>>> select
>>>>>>>>>>>>>>>> functions exits, with that warning message. As a result, my
>>>>>>>>>>>>>>>> test
>>>>>>>>>>>>>>>> object is incomplete.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On top of that, and I can't really explain this, the row
>>>>>>>>>>>>>>>> positions are
>>>>>>>>>>>>>>>> messed up, e.g.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> all.equal(df[100,],test[100,])
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> returns FALSE.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> How can I work around this?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at uw.edu>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Enrico,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I often have data frames where I need to perform ID
>>>>>>>>>>>>>>>>>> conversions on
>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> more of the columns while preserving the order of the
>>>>>>>>>>>>>>>>>> rows,
>>>>>>>>>>>>>>>>>> e.g.:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> GeneSymbol Value1 Value2
>>>>>>>>>>>>>>>>>> GS1 2.5 0.1
>>>>>>>>>>>>>>>>>> GS2 3 0.2
>>>>>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And I want to obtain:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2
>>>>>>>>>>>>>>>>>> GS1 EG1 2.5 0.1
>>>>>>>>>>>>>>>>>> GS2 EG2 3 0.2
>>>>>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What I've done so far was to create a function that uses
>>>>>>>>>>>>>>>>>> org.Hs.eg.db to
>>>>>>>>>>>>>>>>>> loop over the rows of the column and does the conversion:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> library(org.Hs.eg.db)
>>>>>>>>>>>>>>>>>> alias2EG<- function(x) {
>>>>>>>>>>>>>>>>>> for (i in 1:length(x)) {
>>>>>>>>>>>>>>>>>> if (!is.na(x[i])) {
>>>>>>>>>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1]
>>>>>>>>>>>>>>>>>> if (!is.null(repl)) {
>>>>>>>>>>>>>>>>>> x[i]<- repl
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>> else {
>>>>>>>>>>>>>>>>>> x[i]<- NA
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>> return(x)
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I should first note that gene symbols are not unique,
>>>>>>>>>>>>>>>>> so you
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> taking a
>>>>>>>>>>>>>>>>> chance on your mappings. Is there no other annotation for
>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>> data?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In addition, you should note that it is almost always
>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> think of
>>>>>>>>>>>>>>>>> objects as vectors and matrices in R, rather than as
>>>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> looped over (e.g., R isn't Perl or C).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first.two<- select(org.Hs.eg.db,
>>>>>>>>>>>>>>>>> as.character(df$GeneSymbol),
>>>>>>>>>>>>>>>>> "ENTREZID",
>>>>>>>>>>>>>>>>> "SYMBOL")
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Note that there used to be a warning or an error (don't
>>>>>>>>>>>>>>>>> remember
>>>>>>>>>>>>>>>>> which)
>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>> you did something like this, stating that gene symbols are
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> unique,
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> that you shouldn't do this sort of thing. Apparently this
>>>>>>>>>>>>>>>>> warning has
>>>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>>> removed, but the issue remains valid.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ## check yourself
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ## if true, proceed
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> df<- data.frame(first.two, df[,-1])
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Jim
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> and then call the function like this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This works well, but gets very slow when I need to do
>>>>>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>> conversions
>>>>>>>>>>>>>>>>>> on large datasets.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is there any way I can achieve the same result but in a
>>>>>>>>>>>>>>>>>> quicker, more
>>>>>>>>>>>>>>>>>> efficient way?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> James W. MacDonald, M.S.
>>>>>>>>>>>>>>>>> Biostatistician
>>>>>>>>>>>>>>>>> University of Washington
>>>>>>>>>>>>>>>>> Environmental and Occupational Health Sciences
>>>>>>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>>>>>>>>>>>> Seattle WA 98105-6099
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> James W. MacDonald, M.S.
>>>>>>>>>>>>>>> Biostatistician
>>>>>>>>>>>>>>> University of Washington
>>>>>>>>>>>>>>> Environmental and Occupational Health Sciences
>>>>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>>>>>>>>>> Seattle WA 98105-6099
>>>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Hervé Pagès
>>>>>>>>>>>>
>>>>>>>>>>>> Program in Computational Biology
>>>>>>>>>>>> Division of Public Health Sciences
>>>>>>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>>>>>>> P.O. Box 19024
>>>>>>>>>>>> Seattle, WA 98109-1024
>>>>>>>>>>>>
>>>>>>>>>>>> E-mail: hpages at fhcrc.org
>>>>>>>>>>>> Phone: (206) 667-5791
>>>>>>>>>>>> Fax: (206) 667-1319
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Hervé Pagès
>>>>>>>>>
>>>>>>>>> Program in Computational Biology
>>>>>>>>> Division of Public Health Sciences
>>>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>>>> P.O. Box 19024
>>>>>>>>> Seattle, WA 98109-1024
>>>>>>>>>
>>>>>>>>> E-mail: hpages at fhcrc.org
>>>>>>>>> Phone: (206) 667-5791
>>>>>>>>> Fax: (206) 667-1319
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone: (206) 667-5791
>>>>> Fax: (206) 667-1319
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list