[Rd] setequal: better readability, reduced memory footprint, and minor speedup

Michael Lawrence lawrence.michael at gene.com
Fri Jan 9 00:38:48 CET 2015


Currently unique() does duplicated() internally and then extracts. One
could make a countUnique that simply counts, rather than allocate the
logical return value of duplicated(). But so much of the cost is in the
hash operation that it probably won't help much, but that might depend on
the sizes of things. The more unique elements, the better it would perform.


On Thu, Jan 8, 2015 at 2:06 PM, Peter Haverty <haverty.peter at gene.com>
wrote:

> How about unique them both and compare the lengths?  It's less work,
> especially allocation.
>
>
>
> Pete
>
> ____________________
> Peter M. Haverty, Ph.D.
> Genentech, Inc.
> phaverty at gene.com
>
> On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com> wrote:
>
> > If you look at the definition of %in%, you'll find that it is implemented
> > using match, so if we did as you suggest, I give it about three days
> before
> > someone suggests to inline the function call... Readability of source
> code
> > is not usually our prime concern.
> >
> > The && idea does have some merit, though.
> >
> > Apropos, why is there no setcontains()?
> >
> > -pd
> >
> > > On 06 Jan 2015, at 22:02 , Hervé Pagès <hpages at fredhutch.org> wrote:
> > >
> > > Hi,
> > >
> > > Current implementation:
> > >
> > > setequal <- function (x, y)
> > > {
> > >  x <- as.vector(x)
> > >  y <- as.vector(y)
> > >  all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
> > > }
> > >
> > > First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L)
> >
> > 0L'
> > > with 'x %in% y' and 'y %in% x', respectively. They're strictly
> > > equivalent but the latter form is a lot more readable than the former
> > > (isn't this the "raison d'être" of %in%?):
> > >
> > > setequal <- function (x, y)
> > > {
> > >  x <- as.vector(x)
> > >  y <- as.vector(y)
> > >  all(c(x %in% y, y %in% x))
> > > }
> > >
> > > Furthermore, replacing 'all(c(x %in% y, y %in x))' with
> > > 'all(x %in% y) && all(y %in% x)' improves readability even more and,
> > > more importantly, reduces memory footprint significantly on big vectors
> > > (e.g. by 15% on integer vectors with 15M elements):
> > >
> > > setequal <- function (x, y)
> > > {
> > >  x <- as.vector(x)
> > >  y <- as.vector(y)
> > >  all(x %in% y) && all(y %in% x)
> > > }
> > >
> > > It also seems to speed up things a little bit (not in a significant
> > > way though).
> > >
> > > Cheers,
> > > H.
> > >
> > > --
> > > Hervé Pagès
> > >
> > > Program in Computational Biology
> > > Division of Public Health Sciences
> > > Fred Hutchinson Cancer Research Center
> > > 1100 Fairview Ave. N, M1-B514
> > > P.O. Box 19024
> > > Seattle, WA 98109-1024
> > >
> > > E-mail: hpages at fredhutch.org
> > > Phone:  (206) 667-5791
> > > Fax:    (206) 667-1319
> > >
> > > ______________________________________________
> > > R-devel at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list