[Bioc-devel] merging DFrames

Laurent Gatto |@urent@g@tto @end|ng |rom uc|ouv@|n@be
Wed Oct 21 20:22:24 CEST 2020


Thank you both - issue has just been opened.

Merci Hervé for pointing out the direct use of the `List()` constructor.

Laurent

________________________________________
From: Michael Lawrence <lawrence.michael using gene.com>
Sent: 21 October 2020 19:13
To: Pages, Herve
Cc: Laurent Gatto; bioc-devel using r-project.org
Subject: Re: [Bioc-devel] merging DFrames

Laurent,

Thanks for bringing this up and offering to help. Yes, please raise an issue. There's an opportunity to implement faster matching than base::merge(), using stuff like matchIntegerQuads(), findMatches(), and grouping().

grouping() can be really fast for character vectors, since it takes advantage of string internalization. For example, let's say you're merging on three character vector keys. Concatenate the keys of 'y' onto they keys of 'x'. Then call grouping(k1, k2, k3) and you effectively have a matching. Should be way faster than the paste() approach used by base::merge(). Would be interesting to see.

Michael

On Wed, Oct 21, 2020 at 9:37 AM Pages, Herve <hpages using fredhutch.org<mailto:hpages using fredhutch.org>> wrote:
Hi Laurent,

I think the current implementation was just an expedient to have
something that works (in most cases). I don't know if a proper
implementation that doesn't go thru data.frame is on the TODO list. Michael?

I suggest you open an issue on GitHub under S4Vectors.

Cheers,
H.

PS: Note that you can pass the list elements directly to the List()
constructor, no need to construct an ordinary list first:

   List(1, 1:2, 1:3)  # same as List(list(1, 1:2, 1:3)))


On 10/21/20 08:35, Laurent Gatto wrote:
> When merging DFrame instances, the *List types are lost:
>
> The following two instances have NumericList columns (y and z)
> d1 <- DataFrame(x = letters[1:3], y = List(list(1, 1:2, 1:3)))
> d2 <- DataFrame(x = letters[1:3], z = List(list(1:3, 1:2, 1)))
>
> d1
> ## DataFrame with 3 rows and 2 columns
> ##             x             y
> ##   <character> <NumericList>
> ## 1           a             1
> ## 2           b           1,2
> ## 3           c         1,2,3
>
> That are however converted to list when merged
>
> merge(d1, d2, by = "x")
> ## DataFrame with 3 rows and 3 columns
> ##             x      y      z
> ##   <character> <list> <list>
> ## 1           a      1  1,2,3
> ## 2           b    1,2    1,2
> ## 3           c  1,2,3      1
>
> Looking at merge,DataTable,DataTable (form with merge,DFrame,DFrame inherits), this makes sense given that they are converted to data.frames, merged with merge,data.frame,data.frame and the results is coerced back to DFrame:
>
>> getMethod("merge", c("DataTable", "DataTable"))
> Method Definition:
>
> function (x, y, ...)
> {
>      .local <- function (x, y, by, ...)
>      {
>          if (is(by, "Hits")) {
>              return(.mergeByHits(x, y, by, ...))
>          }
>          as(merge(as(x, "data.frame"), as(y, "data.frame"), by,
>              ...), class(x))
>      }
>      .local(x, y, ...)
> }
> <bytecode: 0x556dd0032ca8>
> <environment: namespace:S4Vectors>
>
> Signatures:
>          x           y
> target  "DataTable" "DataTable"
> defined "DataTable" "DataTable"
>
> I would like not to loose the *List classes in the individual DFrames.
>
> Am I missing something? Is this something that is on the todo list, or that I could help with?
>
> Best wishes,
>
> Laurent
>
>
> _______________________________________________
> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=TUxwEgK30pAlKpQ6SAJcnT6kPVktHlJ-9R_Al6ri-Mg&s=uqmel2bDfLejAXpRYsi-PFcGqjn8b6W-JmfpZDhOF7U&e=<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel%26d%3DDwICAg%26c%3DeRAMFD45gAfqt84VtBcfhQ%26r%3DBK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA%26m%3DTUxwEgK30pAlKpQ6SAJcnT6kPVktHlJ-9R_Al6ri-Mg%26s%3Duqmel2bDfLejAXpRYsi-PFcGqjn8b6W-JmfpZDhOF7U%26e%3D&data=04%7C01%7Claurent.gatto%40uclouvain.be%7C584acb4d731841b0a69508d875e4a068%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C0%7C0%7C637388972091221595%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NH8unxkgycej2AJIyCJxrE6J8OJVFKrciV48ra3vxJs%3D&reserved=0>
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org<mailto:hpages using fredhutch.org>
Phone:  (206) 667-5791
Fax:    (206) 667-1319
_______________________________________________
Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=04%7C01%7Claurent.gatto%40uclouvain.be%7C584acb4d731841b0a69508d875e4a068%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C0%7C0%7C637388972091231547%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K5CFh04oSsBCszhNqzazM76%2BU1We8HtvlXjIftHT41g%3D&reserved=0>


--
Michael Lawrence
Senior Scientist, Data Science and Statistical Computing
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
michafla using gene.com<mailto:michafla using gene.com>

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube



More information about the Bioc-devel mailing list