[R] Big data and column correspondence problem

Tue Jul 26 09:42:23 CEST 2011

For question (a), do:

which(AA%in%BB)

Question (b) is very ambiguous to me. It makes little sense for your example
because all values of BB are in AA. Therefore I am wondering whether you
meant in question (a) that you want to find all values in BB that are in AA.
That's not the same thing. I am also not sure what exactly you mean by
"within the lines of B that correspond to the values of AA. If you mean "all
the lines of B that for which AA is in BB, then you get that by:

B[which(AA%in%BB) , ]

However, this gives an error because AA has more values in BB than the
number of rows in B. This leads me to believe that you might want 

which(BB%in%AA) 

for question (a). In this case you would get the lines of B by

B[which(BB%in%AA) , ]

which in this example are all rows of B.

Again, part (b) is very opaque to me. It would help if you described it step
by step as to what it should and what the outcome of every step along the
way should be. Just from the final result that it should produce and your
description, I cannot make sense of it. But maybe another helper can.

Daniel

murilofm wrote:
> 
> Greetings,
> 
> I've been struggling for some time with a problem concerning a big
> database that i have to deal with.
> I'll try to exemplify my problem since the database is really big.
> Suppose I have the following data:
> 
> AA = c(4,4,4,2,2,6,8,9)
> A1 = c(3,3,5,5,5,7,11,12)
> A2 = c(3,3,5,5,5,7,11,12)
> A = cbind(A, A1, A2)
> 
> BB = c(2,2,4,6,6)
> B1 =c(5,11,7,13,NA)
> B2 =c(3,12,11,NA,NA)
> B3 =c(12,13,NA,NA,NA)
> 
> B=cbind(BB,B1,B2,B3)
> 
> I have to do the following:
> 
> 1. Create a dummy (binary) variable in a new column of A that indicates
> if, for each row:
> a) the value from the column AA can be found in BB
> b) within the lines of B that corresponds to the value of AA, I can find
> both A1 and A2 in B1, B2 or B3.
> 
> In this example i would have
> [0,0,1,1,1,0,0,0]
> 
> I been able to do it with some loops; the problem is that since in the
> original data A has 2.936.044 lines and B has 14.965 it's taking forever
> to finish (probably because I might be doing the wrong way).
> 
> I would really appreciate any help or advice on how to deal with this.
> Thanks!
> 

--
View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3695065.html
Sent from the R help mailing list archive at Nabble.com.