[R] How can I eliminate a loop over a data.table?
Matteo Richiardi
matteo.richiardi at unito.it
Tue Mar 19 01:23:58 CET 2013
I've two data.tables as shown below:
***
N = 10
A.DT <- data.table(a1 = c(rnorm(N,0,1)), a2 = NA))
B.DT <- data.table(b1 = c(rnorm(N,0,1)), b2 = 1:N)
setkey(A.DT,a1)
setkey(B.DT,b1)
***
I tried to change my previous data.frame implementation to a
data.table implementation by changing the for-loop as shown below:
***
for (i in 1:nrow(B.DT)) {
for (j in nrow(A.DT):1) {
if (B.DT[i,b2] <= N/2
&& B.DT[i,b1] < A.DT[j,a1]) {
A.DT[j,]$a2 <- B.DT[i,]$b1
break
}
}
}
***
I get the following error message:
***
Error in `[<-.data.table`(`*tmp*`, j, a2, value = -0.391987468746123) :
object "a2" not found
***
I think the way I access data.table is not quite right. I am new to
it. I guess there is a quicker way of doing it than cycling up and
down the two datatables.
I'd like to know if the loop shown above could be simplified/vectorised.
The data.table data for copy/paste reads:
***
# A.DT
a1 a2
1 -1.4917779 NA
2 -1.0731161 NA
3 -0.7533091 NA
4 -0.3673273 NA
5 -0.159569 NA
6 -0.1551948 NA
7 -0.0430574 NA
8 0.1783496 NA
9 0.4276034 NA
10 1.0697412 NA
# B.DT
b1 b2
1 0.64229018 1
2 1.00527902 2
3 0.24746294 3
4 -0.50288835 4
5 0.34447791 5
6 -0.22205129 6
7 0.60099079 7
8 -0.70242284 8
9 0.6298599 9
10 0.08917988 10
***
The output I expect is:
***
# OUTPUT
a1 a2
1 -1.4917779 NA
2 -1.0731161 NA
3 -0.7533091 NA
4 -0.3673273 NA
5 -0.159569 NA
6 -0.1551948 NA
7 -0.0430574 NA
8 0.1783496 -0.50288835
9 0.4276034 0.24746294
10 1.0697412 0.64229018
***
The algorithm goes down one table, and for each row go up the other
table, check some conditions and modify values accordingly. More
specifically, it goes down B.DT, and for each row in B.DT goes up A.DT
and assigns to a2 the first value of b1 such that b1 is smaller than
a1. An additional condition is checked before assignment (b2 being
equal or smaller than 5 in this example).
0.64229018 is the first value in B.DT, and it is assigned to the last
unit of A.DT.
1.00527902 is the second value in B.DT, but it is left unassigned
because it is bigger than all other values in A.DT.
0.24746294 is the third value in B.DT, and it is assigned to the
second last unit in A.DT.
-0.50288835 is the fourth value in B.DT, and it is assigned to unit #8 in A.DT
0.34447791 is the fifth value in B.DT, and it is left unassigned
because it is too big.
This is of course a simplified problem (and therefore may not make
much sense). Thanks for your time and input.
Matteo
More information about the R-help
mailing list