[R] data manipulation in R
Patrick Ball
lookout_20005 at yahoo.com
Tue Apr 17 00:27:03 CEST 2001
Wow! great answers from Thomas and Reid. Very helpful
stuff.
To continue the discussion, Thomas rightly pointed out
(1) that the matrix I described is complete, not
lower-triangular as I'd erroneously written; and (2)
you have to iterate over either events or
observations. My solution went over obersvations.
His description below is exactly what we see, events
[A] by observers [B].
> Suppose A is the matrix of zeros & ones, in your
> case
> > A
> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
> [1,] 1 1 1 0 0 0 0
> [2,] 0 1 0 1 1 0 0
> [3,] 0 0 0 1 0 0 0
> [4,] 0 0 0 1 0 1 1
and as soon as I'm back to my home machine I'll test
the solution he suggests:
> We can make an incidence matrix B for the graph
> nevent<-NCOL(A)
> nobs<-NROW(A)
>
> B<-matrix(rowsum(apply(A,2,function(x)
> outer(x,x)),rep(1,nobs*nobs)),ncol=nobs)
>
>
>
> And now find the connected components by powering up
> B
>
> D<-(B>0)
> for(i in 1:ceiling(log(nobs,2))){
> Dnew<-(D%*%D)>0
> if (all(Dnew==D)) break
> D<-Dnew
>
> }
>
> I think we now have D[i,j]==TRUE if i,j are in the
> same circuit.
>
Reid suggested that
>
> Assumming I've correctly understood the problem, the
> algorithm you've
> sketched seems not to take proper account of
> observers (nodes) separated by
> more than one other observer. After the first two
> steps, you have for each i
> the set N(i) of its neighbors (nodes connected to i
> by an edge) and from
> this data it is essentially the same problem as
> before to decide which of
> these sets overlap etc. In other words you need to
> iterate these steps as
> many times as the length of the longest distance in
> the graph... This could
> be made to work but isn't efficient.
but actually this is part of the benefit of SQL. The
data look like this:
dataset pairs
ev. obs.
A B
a 1
a 1
a 1
b 2
b 4
b 5
c 4
d 4
d 6
d 7
# in SQL pseudocode
# get list of all observations
sele dist pairs.A order by 1 into out1
# iterate over observations
for each i in (out1.A)
# get all the events observed by i
sele dist pairs.B where pairs.A=i into tmp1
# get all the observations which observe
# one or more events in tmp1; this is
# all the observations which link to i
sele dist pairs.A where A in (sele * from tmp1) _
into tmp2
#count [i_m]
sele cnt(*) from tmp1 into cnt1
# assign events [i_m] to i's circuit if i_m > i
# do some basic counter management before this
# step: i.circuit has either been set in an
# earlier iteration or is blank and initialized;
# if it was set earlier, we are extending it
# beyond the first hops in the graph
update out1 where out1.A>tmp2.A set _
out1.circuit=i.circuit
next i
The SQL calls are fast b/c there are usually
relatively few (<20) event-observation pairs for any
event or observation. So the solution scales
relatively linearly with the number of events or
observations X the number of links. In my data, ~10^3
- 10^4 events & observations, 1/3 as many pairs, and
~1.25 links/observation. So it's not computationally
horrible.
I will be interested to test the SQL against Thomas' R
solution (and will post results if there is interest),
and I'd love to look at C or C++ code as well. Thanks
again to Reid and Thomas - PB.
__________________________________________________
Do You Yahoo!?
Get email at your own domain with Yahoo! Mail.
http://personal.mail.yahoo.com/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list