[Rd] setequal: better readability, reduced memory footprint, and minor speedup

Peter Haverty haverty.peter at gene.com
Fri Jan 9 00:50:24 CET 2015


I was thinking something like:

setequal <- function(x,y) {
xu = unique(x)
yu = unique(y)
if (length(xu) != length(yu)) { return FALSE; }
return (all( match( xu, yu, 0L ) > 0L ) )
}

This lets you fail early for cheap (skipping the allocation from the
">0L"s).  Whether or not this goes fast depends a lot on the uniqueness of
x and y and whether or not you want to optimize for the TRUE or FALSE case.
You'd do much better to make some real hashes in C and compare the keys,
but it's probably not worth the complexity.




Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com

On Thu, Jan 8, 2015 at 2:06 PM, Peter Haverty <phaverty at gene.com> wrote:

> How about unique them both and compare the lengths?  It's less work,
> especially allocation.
>
>
>
> Pete
>
> ____________________
> Peter M. Haverty, Ph.D.
> Genentech, Inc.
> phaverty at gene.com
>
> On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com> wrote:
>
>> If you look at the definition of %in%, you'll find that it is implemented
>> using match, so if we did as you suggest, I give it about three days before
>> someone suggests to inline the function call... Readability of source code
>> is not usually our prime concern.
>>
>> The && idea does have some merit, though.
>>
>> Apropos, why is there no setcontains()?
>>
>> -pd
>>
>> > On 06 Jan 2015, at 22:02 , Hervé Pagès <hpages at fredhutch.org> wrote:
>> >
>> > Hi,
>> >
>> > Current implementation:
>> >
>> > setequal <- function (x, y)
>> > {
>> >  x <- as.vector(x)
>> >  y <- as.vector(y)
>> >  all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
>> > }
>> >
>> > First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L)
>> > 0L'
>> > with 'x %in% y' and 'y %in% x', respectively. They're strictly
>> > equivalent but the latter form is a lot more readable than the former
>> > (isn't this the "raison d'être" of %in%?):
>> >
>> > setequal <- function (x, y)
>> > {
>> >  x <- as.vector(x)
>> >  y <- as.vector(y)
>> >  all(c(x %in% y, y %in% x))
>> > }
>> >
>> > Furthermore, replacing 'all(c(x %in% y, y %in x))' with
>> > 'all(x %in% y) && all(y %in% x)' improves readability even more and,
>> > more importantly, reduces memory footprint significantly on big vectors
>> > (e.g. by 15% on integer vectors with 15M elements):
>> >
>> > setequal <- function (x, y)
>> > {
>> >  x <- as.vector(x)
>> >  y <- as.vector(y)
>> >  all(x %in% y) && all(y %in% x)
>> > }
>> >
>> > It also seems to speed up things a little bit (not in a significant
>> > way though).
>> >
>> > Cheers,
>> > H.
>> >
>> > --
>> > Hervé Pagès
>> >
>> > Program in Computational Biology
>> > Division of Public Health Sciences
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N, M1-B514
>> > P.O. Box 19024
>> > Seattle, WA 98109-1024
>> >
>> > E-mail: hpages at fredhutch.org
>> > Phone:  (206) 667-5791
>> > Fax:    (206) 667-1319
>> >
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> --
>> Peter Dalgaard, Professor,
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list