[External] Re: Strange behavior when sampling rows of a data frame
Fri Jun 19 20:40:32 CEST 2020
The behavior has been there much longer than that in R and it's been a
known issue with complex assignment for a long time (not the only
one). You're in a better position than I to know how Splus handles this.
The complex assignment expression
df[<index>, ]$treated <- TRUE
is basically evaluated as
tmp <-df[<index>, ]
tmp$treated <- TRUE
df[<index>,] <- tmp
So the <index> argument is evaluated twice. This is always a little
inefficient, but probably not what you want if there are side effects
in the index argument. So the main take-away is:
Don't use index arguments with side effects in complex assignments.
It is in principle possible, when standard evaluation is in use, to
capture the value of <index> from the first evaluation and re-use for
the second. But, for better or worse, assignment methods can and do
use non-standard evaluation for the index arguments, and it would be
very hard for authors of such methods to avoid this. So changing to
avoid multiple index evaluation would always have to come with an
asterisk.
There are other issues with complex assignment as implemented
currently that have higher priority but are also quite tricky to
address. Possibly this one can be addressed at the same time.
On Fri, 19 Jun 2020, William Dunlap via R-help wrote:
> It is a bug that has been present in R since at least R-2.14.0 (the oldest
> that I have installed on my laptop).
>
>
>
On Fri, Jun 19, 2020 at 10:37 AM Rui Barradas wrote:
>
>> Hello,
>>
>>
>> Thanks, I hadn't thought of that.
>>
>> But, why? Is it evaluated once before assignment and a second time when
>> the assignment occurs?
>>
>> To trace both sample and `[<-` gives 2 calls to sample.
>>
>>
>> trace(sample)
>> trace(`[<-`)
>> df[sample(nrow(df), 3),]$treated <- TRUE
>> trace: sample(nrow(df), 3)
>> trace: `[<-`(`*tmp*`, sample(nrow(df), 3), , value = list(unit = c(7L,
>> 6L, 8L), treated = c(TRUE, TRUE, TRUE)))
>> trace: sample(nrow(df), 3)
>>
>>
>>
>>
Às 17:20 de 19/06/2020, William Dunlap escreveu:
>>> The first subscript argument is getting evaluated twice.
>>>> trace(sample)
>>>> set.seed(2020); df[i<-sample(10,3), ]$Treated <- TRUE
>>> trace: sample(10, 3)
>>> trace: sample(10, 3)
>>>> i
>>> [1] 1 10 4
>>>> set.seed(2020); sample(10,3)
>>> trace: sample(10, 3)
>>> [1] 7 6 8
>>>> sample(10,3)
>>> trace: sample(10, 3)
>>> [1] 1 10 4
>>>
>>>
>>>
On Fri, Jun 19, 2020 at 8:46 AM Rui Barradas wrote:
>>> <mailto:ruipbarradas using sapo.pt>> wrote:
>>>
>>> Hello,
>>>
>>> I don't have an answer on the reason why this happens but it seems
>>> like
>>> a bug. Where?
>>>
>>> In which of `[<-.data.frame` or `[<-.default`?
>>>
>>> A solution is to subset and assign the vector:
>>>
>>>
>>> set.seed(2020)
>>> df2 <- data.frame(unit = 1:10)
>>> df2$treated <- FALSE
>>>
>>> df2$treated[sample(nrow(df2), 3)] <- TRUE
>>> df2
>>> # unit treated
>>> #1 1 FALSE
>>> #2 2 FALSE
>>> #3 3 FALSE
>>> #4 4 FALSE
>>> #5 5 FALSE
>>> #6 6 TRUE
>>> #7 7 TRUE
>>> #8 8 TRUE
>>> #9 9 FALSE
>>> #10 10 FALSE
>>>
>>>
>>> Or
>>>
>>>
>>> set.seed(2020)
>>> df3 <- data.frame(unit = 1:10)
>>> df3$treated <- FALSE
>>>
>>> df3[sample(nrow(df3), 3), "treated"] <- TRUE
>>> df3
>>> # result as expected
>>>
>>>
>>>
>>>
>>>
Às 13:49 de 19/06/2020, Sébastien Lahaie escreveu:
>>> > I ran into some strange behavior in R when trying to assign a
>>> treatment to
>>> > rows in a data frame. I'm wondering whether any R experts can
>>> explain
>>> > what's going on.
>>> >
>>> > First, let's assign a treatment to 3 out of 10 rows as follows.
>>> >
>>> >> df <- data.frame(unit = 1:10)
>>> >> df$treated <- FALSE
>>> >> s <- sample(nrow(df), 3)
>>> >> df[s,]$treated <- TRUE
>>> >> df
>>> > unit treated
>>> >
>>> > 1 1 FALSE
>>> >
>>> > 2 2 TRUE
>>> >
>>> > 3 3 FALSE
>>> >
>>> > 4 4 FALSE
>>> >
>>> > 5 5 TRUE
>>> >
>>> > 6 6 FALSE
>>> >
>>> > 7 7 TRUE
>>> >
>>> > 8 8 FALSE
>>> >
>>> > 9 9 FALSE
>>> >
>>> > 10 10 FALSE
>>> >
>>> > This is as expected. Now we'll just skip the intermediate step
>>> of saving
>>> > the sampled indices, and apply the treatment directly as follows.
>>> >
>>> >> df <- data.frame(unit = 1:10)
>>> >> df$treated <- FALSE
>>> >> df[sample(nrow(df), 3),]$treated <- TRUE
>>> >> df
>>> > unit treated
>>> >
>>> > 1 6 TRUE
>>> >
>>> > 2 2 FALSE
>>> >
>>> > 3 3 FALSE
>>> >
>>> > 4 9 TRUE
>>> >
>>> > 5 5 FALSE
>>> >
>>> > 6 6 FALSE
>>> >
>>> > 7 7 FALSE
>>> >
>>> > 8 5 TRUE
>>> >
>>> > 9 9 FALSE
>>> >
>>> > 10 10 FALSE
>>> >
>>> > Now the data frame still has 10 rows with 3 assigned to the
>>> treatment. But
>>> > the units are garbled. Units 1 and 4 have disappeared, for
>>> instance, and
>>> > there are duplicates for 6 and 9, one assigned to treatment and
>>> the other
>>> > to control. Why would this happen?
>>> >
>>> >
>>> >
>>>
>>
>
--
