[R] union of two sets are smaller than one set?

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Sun Jan 31 22:51:41 CET 2021


Martin,

You did not say your two starting objects were already sets. You said they
were vectors of strings. It may well be that your strings included
duplicates. For example, If I read in lots of text with a blank line between
paragraphs, I would have lots of seemingly empty and identical parts. Just
converting that into a set would shrink it.

You have not said how you created or processed your initial two vectors. It
is also possible parts were sort of DELETED as in removing the string
pointed to by some entry but leaving a null pointer of sorts which would
leave the length of the vector longer than the useful contents.

Your strings seem to be what may be filenames. Are they unique, especially
if they are files in different folders/directories?

There are many ways to check, but using your method, try this:

length(base::union(s1, s1))

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Martin Møller
Skarbiniks Pedersen
Sent: Sunday, January 31, 2021 3:57 PM
To: R mailing list <r-help using r-project.org>
Subject: [R] union of two sets are smaller than one set?

This is really puzzling me and when I try to make a small example everything
works like expected.

The problem:

I got these two large vectors of strings.

> str(s1)
 chr [1:766608] "0.dk" ...
> str(s2)
 chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ...

And I need to create the union-set of s1 and s2.
I expect the size of the union-set to be between 766608 and 766608+59387.
However it is 681193 which is less that number of elements in s1!

> length(base::union(s1, s2))
[1] 681193

Any hints?

Regards
Martin

	[[alternative HTML version deleted]]

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list