[Bioc-devel] merging DFrames

Michael Lawrence |@wrence@m|ch@e| @end|ng |rom gene@com
Wed Oct 21 19:13:12 CEST 2020


Laurent,

Thanks for bringing this up and offering to help. Yes, please raise an
issue. There's an opportunity to implement faster matching than
base::merge(), using stuff like matchIntegerQuads(), findMatches(), and
grouping().

grouping() can be really fast for character vectors, since it takes
advantage of string internalization. For example, let's say you're merging
on three character vector keys. Concatenate the keys of 'y' onto they keys
of 'x'. Then call grouping(k1, k2, k3) and you effectively have a matching.
Should be way faster than the paste() approach used by base::merge(). Would
be interesting to see.

Michael

On Wed, Oct 21, 2020 at 9:37 AM Pages, Herve <hpages using fredhutch.org> wrote:

> Hi Laurent,
>
> I think the current implementation was just an expedient to have
> something that works (in most cases). I don't know if a proper
> implementation that doesn't go thru data.frame is on the TODO list.
> Michael?
>
> I suggest you open an issue on GitHub under S4Vectors.
>
> Cheers,
> H.
>
> PS: Note that you can pass the list elements directly to the List()
> constructor, no need to construct an ordinary list first:
>
>    List(1, 1:2, 1:3)  # same as List(list(1, 1:2, 1:3)))
>
>
> On 10/21/20 08:35, Laurent Gatto wrote:
> > When merging DFrame instances, the *List types are lost:
> >
> > The following two instances have NumericList columns (y and z)
> > d1 <- DataFrame(x = letters[1:3], y = List(list(1, 1:2, 1:3)))
> > d2 <- DataFrame(x = letters[1:3], z = List(list(1:3, 1:2, 1)))
> >
> > d1
> > ## DataFrame with 3 rows and 2 columns
> > ##             x             y
> > ##   <character> <NumericList>
> > ## 1           a             1
> > ## 2           b           1,2
> > ## 3           c         1,2,3
> >
> > That are however converted to list when merged
> >
> > merge(d1, d2, by = "x")
> > ## DataFrame with 3 rows and 3 columns
> > ##             x      y      z
> > ##   <character> <list> <list>
> > ## 1           a      1  1,2,3
> > ## 2           b    1,2    1,2
> > ## 3           c  1,2,3      1
> >
> > Looking at merge,DataTable,DataTable (form with merge,DFrame,DFrame
> inherits), this makes sense given that they are converted to data.frames,
> merged with merge,data.frame,data.frame and the results is coerced back to
> DFrame:
> >
> >> getMethod("merge", c("DataTable", "DataTable"))
> > Method Definition:
> >
> > function (x, y, ...)
> > {
> >      .local <- function (x, y, by, ...)
> >      {
> >          if (is(by, "Hits")) {
> >              return(.mergeByHits(x, y, by, ...))
> >          }
> >          as(merge(as(x, "data.frame"), as(y, "data.frame"), by,
> >              ...), class(x))
> >      }
> >      .local(x, y, ...)
> > }
> > <bytecode: 0x556dd0032ca8>
> > <environment: namespace:S4Vectors>
> >
> > Signatures:
> >          x           y
> > target  "DataTable" "DataTable"
> > defined "DataTable" "DataTable"
> >
> > I would like not to loose the *List classes in the individual DFrames.
> >
> > Am I missing something? Is this something that is on the todo list, or
> that I could help with?
> >
> > Best wishes,
> >
> > Laurent
> >
> >
> > _______________________________________________
> > Bioc-devel using r-project.org mailing list
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=TUxwEgK30pAlKpQ6SAJcnT6kPVktHlJ-9R_Al6ri-Mg&s=uqmel2bDfLejAXpRYsi-PFcGqjn8b6W-JmfpZDhOF7U&e=
> >
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages using fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Michael Lawrence
Senior Scientist, Data Science and Statistical Computing
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
michafla using gene.com

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list