[R] How to filter data using sets generated by flattening with dcast, when I can't store those sets in a data frame

Thu Mar 12 15:53:03 CET 2015

In base R you can do what I think you want with aggregate() and Filter().
E.g.,
  > a <- aggregate(df["Day"], df["ID"], function(x)x)
  > str(a)
  'data.frame':   3 obs. of  2 variables:
   $ ID : num  1 2 3
   $ Day:List of 3
    ..$ 1: num  1 2 4 7
    ..$ 5: num  2 3
    ..$ 7: num  1 3 4 8
  > i14 <- Filter(function(i){all(c(1,4) %in% a$Day[[i]])},
seq_len(nrow(a)))
  > a[i14,]
    ID        Day
  1  1 1, 2, 4, 7
  3  3 1, 3, 4, 8

Note that 'reshape2' is not 'R', it is a user-contributed package that runs
in R.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Mar 12, 2015 at 12:55 AM, Jocelyn Ireson-Paine <popx at j-paine.org>
wrote:

> This is a fairly long question. It's about a problem that's easy to
> specify in terms of sets, but that I found hard to solve in R by using
> them, because of the strange design of R data structures. In explaining it,
> I'm going to touch on the reshape2 library, dcast, sets, and the
> non-orthogonality of R.
>
> My problem stems from some drug-trial data that I've been analysing for
> the Oxford Pain Research Unit. Here's an example. Imagine a data frame
> representing patients in a trial of pain-relief drugs. The trial lasts for
> ten days. Each patient's pain is measured once a day, and the values are
> recorded in a data frame, one row per patient per day. Like this:
>
>   ID  Day  Pain
>    1    1  10
>    1    2   9
>    1    4   7
>    1    7   2
>    2    2   8
>    2    3   7
>    3    1  10
>    3    3   6
>    3    4   6
>    3    8   2
>
> Unfortunately, many patients have measurements missing. Thus, in the
> example above, patient 1 was only observed on days 1, 2, 4, and 7, rather
> than on the full ten days. But a patient's measurements are only useful to
> us if that patient has a certain minimum set of days, so I need to check
> for patients who lack those days. Let's assume that these days are numbers
> 1, 4, and 9.
>
> Such a question is trivial to state in terms of sets. Let D(i) denote the
> set of days on which patient i was measured: then I want to find out which
> patients p, or how many patients p, have a D(p) that contains the set
> {1,4,9}.
>
> The obvious way to solve this is to write a function that tells me whether
> one set is a superset of another. Then flatten my data frame so that it
> looks like this:
>
>   ID  Days
>    1  {1,2,4,7}
>    2  {2,3}
>    3  {1,3,4,8}
>
> And finally, filter it by some R translation of
>
>   flattened[ includes( flattened$Days, {1,4,9} ), ]
>
> I started with the built-in functions that operate on sets represented as
> vectors. These are described in
>  https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html ,
> "Set Operations". For example:
>
>   > union( c(1,2,3), c(2,4,6) )
>   [1] 1 2 3 4 6
>   > intersect( c(1,2,3), c(2,4,6) )
>   [1] 2
>
> So I first wrote a set-inclusion function:
>
>   # True if vector a is a superset of vector b.
>   #
>   includes <- function( a, b )
>   {
>     return( setequal( union( a, b ), a ) )
>   }
>
> Here are some sample calls:
>
>   > includes( c(1), c() )
>   [1] TRUE
>   > includes( c(1), c(1) )
>   [1] TRUE
>   > includes( c(1), c(1,2) )
>   [1] FALSE
>   > includes( c(2,1), c(1,2) )
>   [1] TRUE
>   > includes( c(2,1,3), c(1,2) )
>   [1] TRUE
>   > includes( c(2,1,3), c(4,1,2) )
>   [1] FALSE
>
> I then made myself a variable holding my sample data frame:
>
>   df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 )
>                   , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 )
>                   )
>
> And I tried flattening it, using dcast and an aggregator function as
> described in (amongst many other places) http://seananderson.ca/2013/
> 10/19/reshape.html , "An Introduction to reshape2" by Sean C. Anderson.
>
> The idea behind this is that (for my data) dcast will call the aggregator
> function once per patient ID, passing it all the Day values for the
> patient. The aggregator must combine them in some way, and dcast puts its
> results into a new column. For example, here's an aggregator that merely
> sums its arguments:
>
>   aggregator_making_sum <- function( ... )
>   {
>     return( sum( ... ) )
>   }
>
> If I call it, I get this:
>
>   >  dcast( df, ID~. , fun.aggregate=aggregator_making_sum )
>   Using Day as value column: use value.var to override.
>     ID  .
>   1  1 14
>   2  2  5
>   3  3 16
>
> And here's an aggregator that converts the argument list to a string:
>
>   aggregator_making_string <- function( ... )
>   {
>     return( toString( ... ) )
>   }
>
> Calling it gives this:
>
>   >  dcast( df, ID~. , fun.aggregate=aggregator_making_string )
>   Using Day as value column: use value.var to override.
>     ID          .
>   1  1 1, 2, 4, 7
>   2  2       2, 3
>   3  3 1, 3, 4, 8
>
> In both of these, the three dots denote all arguments to the aggregator,
> as explained in Burns Statistics's http://www.burns-stat.com/the-
> three-dots-construct-in-r/ . My first aggregator sums them; my second
> converts them to a string. Both uses of dcast generate a data frame with a
> column named "." , which contains the aggregates. In the second data frame,
> that may not be so clear: the first column of numbers is row numbers; the
> second column of numbers are the IDs; and the remaining columns form the
> strings, belonging to "." .
>
> But what I want is neither a sum nor a string but a set. Specifically, a
> set that's compatible with the R set operations I called in my 'includes'
> function. Since these sets are vectors, my aggregator should just pack its
> arguments into a vector:
>
>   aggregator_making_set <- function( ... )
>   {
>     return( c( ... ) )
>   }
>
> But when I tried it, I got an error:
>
>   > dcast( df, ID~. , fun.aggregate=aggregator_making_set )
>   Using Day as value column: use value.var to override.
>   Error in vapply(indices, fun, .default) : values must be length 0,
>    but FUN(X[[1]]) result is length 4
>
> It's not an informative error message, because it expects me to know how
> dcast is coded. And I'm surprised that values need to be length 0: length 1
> would seem more appropriate. But perhaps it's trying to say that 'c'
> doesn't work on three-dots argument lists. Let's test that hypothesis:
>
>   test_c_on_three_dots <- function( ... )
>   {
>     return( c( ... ) )
>   }
>
>   >   test_c_on_three_dots( 1 )
>   [1] 1
>   >   test_c_on_three_dots( 1, 2 )
>   [1] 1 2
>   >   test_c_on_three_dots( 1, 2, 3 )
>   [1] 1 2 3
>
> So 'c' does indeed work on three-dots argument lists. The error must have
> been caused by something else. Let's try making a set and putting it into a
> data frame directly:
>
>   > df <- data.frame( col1=c(1,2), col2=c(3,4) )
>   > df
>     col1 col2
>   1    1    3
>   2    2    4
>   > set <- union( c(5,6), c(6,7) )
>   > set
>   [1] 5 6 7
>   > df[ 1, ]$col1 <- set
>   Error in `$<-.data.frame`(`*tmp*`, "col1", value = c(5, 6, 7)) :
>     replacement has 3 rows, data has 1
>
> So that's the problem. Already in 1968, there was a language named Algol68
> which had arrays and, in order to make things easy for its programmers,
> allowed you to create arrays of every data type the language provided. You
> could have arrays of Booleans, arrays of integers, arrays of records,
> arrays of discriminated unions, arrays of procedures, arrays of I/O
> formats, arrays of pointers, and arrays of arrays. The idea was
> "orthogonality" (see for example http://stackoverflow.com/
> questions/1527393/what-is-orthogonality ): that the programmer does not
> have to think about unexpected interactions between the concept of array
> and the concept of the element type, because there are none. If you have a
> data type, you can make arrays of that type. Pop-2 (1970), Snobol4 (1966),
> and Lisp (1958) were similarly generous. But R (1993) isn't. It wants to
> make life hard by forcing me to use different kinds of container for
> different kinds of element. And by providing a nice implementation of sets
> and then not letting me store them.
>
> So I thought about the kinds of data that I _can_ store in a data frame
> and generate by flattening. Strings! So I decided to use my
> aggregator_making_string function to make a string representation of the
> set of days, and to write a set-inclusion function that compared these sets
> against sets represented as vectors:
>
>   includes2 <- function( a_as_string, b )
>   {
>     a <- as.numeric( unlist( strsplit( a_as_string, split="," ) ) )
>     return( setequal( union( a, b ), a ) )
>   }
>
> Here are some example calls:
>
>   > includes2( '1,2,3', c(1) )
>   [1] TRUE
>   > includes2( '1,2,3', c(1,2) )
>   [1] TRUE
>   > includes2( '1,2,3', c(1,2,4) )
>   [1] FALSE
>   > includes2( '1,2,3', c(3) )
>   [1] TRUE
>   > includes2( '1,2,3', c(0,3) )
>   [1] FALSE
>   >
>
> I then tried using it:
>
>   df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 )
>                   , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 )
>                   )
>
>   aggregator_making_string <- function( ... )
>   {
>     return( toString( ... ) )
>   }
>
>   flattened <- dcast( df, ID~. , fun.aggregate=aggregator_making_string )
>
>   # Which patients have a day 1?
>   flattened[ includes2( flattened$. , c(1) ), ]
>
> Unfortunately, that didn't work. The final statement selected every row of
> 'flattened'. I eventually realised that I had to vectorise 'includes2':
>
>   includes3 <- Vectorize( includes2, "a_as_string" )
>
> And that did work:
>
>   >   flattened[ includes3( flattened$. , c(1) ), ]
>     ID          .
>   1  1 1, 2, 4, 7
>   3  3 1, 3, 4, 8
>   >   flattened[ includes3( flattened$. , c(1,2) ), ]
>     ID          .
>   1  1 1, 2, 4, 7
>   >   flattened[ includes3( flattened$. , c(1,3) ), ]
>     ID          .
>   3  3 1, 3, 4, 8
>   >   flattened[ includes3( flattened$. , c(2) ), ]
>     ID          .
>   1  1 1, 2, 4, 7
>   2  2       2, 3
>
> The moral of this email tale is that sets are really useful for filtering
> data, and dcast ought to be really useful for generating sets, but R
> refuses to let me store them in the data frame that dcast generates. I can
> fudge it by representing the sets as strings, but is there a cleaner way to
> solve the problem?
>
> Cheers,
>
> Jocelyn Ireson-Paine
> 07768 534 091
> http://www.jocelyns-cartoons.uk
> http://www.j-paine.org
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]