[R] Transforming relational data

Tue Feb 15 20:15:21 CET 2011

Hello. One (of many) solution might be:

require(data.table)
DT = data.table(read.table(textConnection("    A  B  C
1 1  a  1999
2 1  b  1999
3 1  c  1999
4 1  d  1999
5 2  c  2001
6 2  d  2001"),head=TRUE,stringsAsFactors=FALSE))

firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2]
setkey(firststep,Var1,Var2)
grp3 = c("a","b","d")
firststep[J(expand.grid(grp3,grp3)),nomatch=0][,sum(v)]
# 2.5

If I guess the bigger picture correctly, this can be extended
to make a time series of prior familiarity by including
the year in the key.

If you decide to try this, please make sure to grab the latest
(recent) version of data.table from CRAN (v1.5.3). Suggest that
you run it first to confirm it does return 2.5, then break it
down and run it step by step to see how each part works. You
will need some time to read the vignettes and ?data.table
(which has recently been improved) but I hope you think it is
worth it. Support is available at maintainer("data.table").

HTH
Matthew

On Mon, 14 Feb 2011 09:22:12 -0800, mathijsdevaan wrote:
> Hi,
> 
> I have a large dataset with info on individuals (B) that have been
> involved in projects (A) during multiple years (C). The dataset contains
> three columns: A, B, C. Example:
>    
>    A  B  C
> 1 1  a  1999
> 2 1  b  1999
> 3 1  c  1999
> 4 1  d  1999
> 5 2  c  2001
> 6 2  d  2001
> 7 3  a  2004
> 8 3  c  2004
> 9 3  d  2004
> 
> I am interested in how well all the individuals in a project know each
> other. To calculate this team familiarity measure I want to sum the
> familiarity between all individual pairs in a team. The familiarity
> between each individual pair in a team is calculated as the summation of
> each pair's prior co-appearance in a project divided by the total number
> of team members. So the team familiarity in project 3 = (1/4+1/4) +
> (1/4+1/4+1/2) + (1/4+1/4+1/2) = 2,5 or a has been in project 1 (of size
> 4) with c and d > 1/4+1/4 and c has been in project 1 (of size 4) with 1
> and d > 1/4+1/4 and c has been in project 2 (of size 2) with d > 1/2.
> 
> I think that the best way to do it is to transform the data into an
> edgelist (each pair in one row/two columns) and then creating two
> additional columns for the strength of the familiarity and the year of
> the project in which the pair was active. The problem is that I am stuck
> already in the first step. So the question is: how do I go from the
> current data structure to a list of projects and the familiarity of its
> team members?
> 
> Your help is very much appreciated. Thanks!