[R] Link prediction in social network with R

Tue Dec 28 14:23:58 CET 2010

Dear Eu,

On Wed, Dec 22, 2010 at 12:00 AM, EU JIN LOK <ejlok1 at hotmail.com> wrote:
>
> Dear R users
>
> I'm a novice user of R and have absolutely no prior knowledge of social network analysis, so apologies if my question is trivial. I've spent alot of time trying to solve this on my own but I really can't so hope someone here can help me out. Cheers!
>
> The dataset:
> I'm trying to predict the existance of links (True or False) in a test set using a training set. Both data sets are in an "edgelist" format, where User IDs represents nodes in both columns with the 1st column directing to the 2nd column (see figure 1 below). Using the AUC to evaluate the performance, I am looking for the best algorithm to predict the existance of links in the test data (50% are true and rest are false).
>
> Figure 1:
>> training
> Vertices: 1133143
> Edges: 999
> Directed: TRUE
> Edges:
>
> [0]       105 ->  850956
> [1]       105 -> 1073420
> [2]       105 -> 1102667
> [3]       165 ->  888346
> [4]       165 ->  579649
> [5]       165 ->  136665
> etc..
>
> I'm having problems obtaining the probability scores for the links / edges as most of the scores are for the nodes. An example of this is the graph.knn and page.rank module in igraph.
>
> So my questions are:
> 1) What do I need to do to obtain the scores for the links instead of the nodes (I presume it must be a data preparation step that I must be missing out)?

In general, most people are interested in the nodes of the network, so
most network indices are node level. If you want edge-level indices,
you can create another graph from yours, by transforming the edges
into vertices and vice-versa. Two vertices are connected in the new
graph, if the corresponding two edges in the old graph share an
incident vertex. However, I am sure that there are some vertex
measures that don't make sense for edges at all, so you need to be
careful with this, especially with the interpretation of the results.

Another possibility is to use the few edge-level indices, e.g. edge
betweenness, or just define analog edge measures for the existing
vertex measures.

> 2) Which R package would be the best for running the various techniques - Jackard index, Adamic-Adar, common neightbours, PropFlow, etc

The first three are implemented in igraph if I remember well.

> 3) How to implement a supervised learning method such as random forest (I am guessing I need to obtain a feature list but again, how can I get the scores for the edges)?

I am not an expert on this, but there are are several R packages for
supervised methods, random forests as well, look around on CRAN.

I hope this helps, Best,
Gabor

> Hope I've explain my questions well but do let me know if more clarification is need.
>
> Thanks in advance
> Eu Jin
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Gabor Csardi <Gabor.Csardi at unil.ch>     UNIL DGM