[R] Efficiency Question - Nested lapply or nested for loop

Gabor Grothendieck ggrothendieck at gmail.com
Sat Oct 9 00:28:44 CEST 2010


On Fri, Oct 8, 2010 at 12:47 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Fri, Oct 8, 2010 at 11:35 AM, epowell <EPowell1 at med.miami.edu> wrote:
>>
>> My data looks like this:
>>
>>> data
>>  name G_hat_0_0 G_hat_1_0 G_hat_2_0 G_0 G_hat_0_1 G_hat_1_1 G_hat_2_1 G_1
>> 1  rs0  0.488000  0.448625  0.063375   1  0.480875  0.454500  0.064625   1
>> 2  rs1  0.002375  0.955375  0.042250   1  0.000000  0.062875  0.937125   2
>> 3  rs2  0.050375  0.835875  0.113750   1  0.877250  0.115875  0.006875   0
>> 4  rs3  0.000000  0.074750  0.925250   2  0.897750  0.102000  0.000250   0
>> 5  rs4  0.000125  0.052375  0.947500   2  0.261500  0.724125  0.014375   1
>> 6  rs5  0.003750  0.092125  0.904125   2  0.023000  0.738125  0.238875   1
>>
>> And my task is:
>> For each individual (X) on each row, to find the index corresponding to the
>> max of G_hat_X_0, G_hat_X_1, G_hat_X_2 and then increment the cell of the
>> confusion matrix with the row corresponding to that index and the column
>> corresponding to G_X.
>>
>> For example, in the first row and the first individual, the index with the
>> max value (0.488000) is 0 and the G_0 value is 1, so I would increment
>> matrix index of the first row and second column. (Note that the ranges
>> between rows and columns are one off.  That is accounted for in the code.)
>>
>> In reality the data will be much bigger, containing 10000 rows and a
>> variable number of columns (inds) between 10 and 500.
>>
>> The correct result is:
>>
>>> cmat
>>        tru_rr tru_rv tru_vv
>> call_rr      2      2      0
>> call_rv      0      4      0
>> call_vv      0      0      4
>>
>
> If we reform data into a 3d array, arr, it can be vectorized like this
> where the two args of table correspond to Gmax and Gtru:
>
> arr <- array(t(data[-1]), c(4, 2, 6))
> table(apply(arr[-4,,], 2:3, which.max), arr[4,,] + 1)

A couple of further improvements are that we can replace the array,
arr, with a matrix, mat, and also we can add dimension names in the
table() call:

mat <- matrix(t(data[-1]), 4)
table(Gmax = apply(mat[-4,], 2, which.max), Gtru = mat[4,] + 1)



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list