[Rd] Subsetting the "ROW"s of an object

Hadley Wickham h@wickh@m @ending from gm@il@com
Fri Jun 8 22:49:23 CEST 2018


Hmmm, yes, there must be some special case in the C code to avoid
recycling a length-1 logical vector:

dims <- c(4, 4, 4, 1e5)

arr <- array(rnorm(prod(dims)), dims)
dim(arr)
#> [1]      4      4      4 100000
i <- c(1, 3)

bench::mark(
  arr[i, TRUE, TRUE, TRUE],
  arr[i, , , ]
)[c("expression", "min", "mean", "max")]
#> # A tibble: 2 x 4
#>   expression                    min     mean      max
#>   <chr>                    <bch:tm> <bch:tm> <bch:tm>
#> 1 arr[i, TRUE, TRUE, TRUE]   41.8ms   43.6ms   46.5ms
#> 2 arr[i, , , ]               41.7ms   43.1ms   46.3ms


On Fri, Jun 8, 2018 at 12:31 PM, Berry, Charles <ccberry using ucsd.edu> wrote:
>
>
>> On Jun 8, 2018, at 11:52 AM, Hadley Wickham <h.wickham using gmail.com> wrote:
>>
>> On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry using ucsd.edu> wrote:
>>>
>>>
>>>> On Jun 8, 2018, at 10:37 AM, Hervé Pagès <hpages using fredhutch.org> wrote:
>>>>
>>>> Also the TRUEs cause problems if some dimensions are 0:
>>>>
>>>>> matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE]
>>>> Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] :
>>>>   (subscript) logical subscript too long
>>>
>>> OK. But this is easy enough to handle.
>>>
>>>>
>>>> H.
>>>>
>>>> On 06/08/2018 10:29 AM, Hadley Wickham wrote:
>>>>> I suspect this will have suboptimal performance since the TRUEs will
>>>>> get recycled. (Maybe there is, or could be, ALTREP, support for
>>>>> recycling)
>>>>> Hadley
>>>
>>>
>>> AFAICS, it is not an issue. Taking
>>>
>>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>>>
>>> as a test case
>>>
>>> and using a function that will either use the literal code `x[i,,,,drop=FALSE]' or `eval(mc)':
>>>
>>> subset_ROW4 <-
>>>     function(x, i, useLiteral=FALSE)
>>> {
>>>    literal <- quote(x[i,,,,drop=FALSE])
>>>    mc <- quote(x[i])
>>>    nd <- max(1L, length(dim(x)))
>>>    mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L)
>>>    mc[["drop"]] <- FALSE
>>>    if (useLiteral)
>>>        eval(literal)
>>>    else
>>>        eval(mc)
>>> }
>>>
>>> I get identical times with
>>>
>>> system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),TRUE))
>>>
>>> and with
>>>
>>> system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),FALSE))
>>
>> I think that's because you used a relatively low precision timing
>> mechnaism, and included the index generation in the timing. I see:
>>
>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>> i <- seq(1,length = 10, by = 100)
>>
>> bench::mark(
>>  arr[i, TRUE, TRUE, TRUE],
>>  arr[i, , , ]
>> )
>> #> # A tibble: 2 x 1
>> #>   expression        min    mean   median      max  n_gc
>> #>   <chr>         <bch:t> <bch:t> <bch:tm> <bch:tm> <dbl>
>> #> 1 arr[i, TRUE,…   7.4µs  10.9µs  10.66µs   1.22ms     2
>> #> 2 arr[i, , , ]   7.06µs   8.8µs   7.85µs 538.09µs     2
>>
>> So not a huge difference, but it's there.
>
>
> Funny. I get similar results to yours above albeit with smaller differences. Usually < 5 percent.
>
> But with subset_ROW4 I see no consistent difference.
>
> In this example, it runs faster on average using `eval(mc)' to return the result:
>
>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>> i <- seq(1,length=10,by=100)
>> bench::mark(subset_ROW4(arr,i,FALSE), subset_ROW4(arr,i,TRUE))[,1:8]
> # A tibble: 2 x 8
>   expression                      min     mean   median      max `itr/sec` mem_alloc  n_gc
>   <chr>                      <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
> 1 subset_ROW4(arr, i, FALSE)   28.9µs   34.9µs   32.1µs   1.36ms    28686.    5.05KB     5
> 2 subset_ROW4(arr, i, TRUE)    28.9µs     35µs   32.4µs 875.11µs    28572.    5.05KB     5
>>
>
> And on subsequent reps the lead switches back and forth.
>
>
> Chuck
>



-- 
http://hadley.nz



More information about the R-devel mailing list