[R] Data transformation & cleaning

Wed Sep 28 12:29:34 CEST 2011

On 09/28/2011 01:13 PM, pip56789 wrote:
> Hi,
>
> I have a few methodological and implementation questions for ya'll. Thank
> you in advance for your help. I have a dataset that reflects people's
> preference choices. I want to see if there's any kind of clustering effect
> among certain preference choices (e.g. do people who pick choice A also pick
> choice D).
>
> I have a data set that has one record per user ID, per preference choice.
> It's a "long" form of a data set that looks like this:
>
> ID | Page
> 123 | Choice A
> 123 | Choice B
> 456 | Choice A
> 456 | Choice B
> ...
>
> I thought that I should do the following
>
> 1. Make the data set "wide", counting the observations so the data looks
> like this:
> ID | Count of Preference A | Count of Preference B
> 123 | 1 | 1
> ...
>
> Using
> table1<- dcast(data,ID ~ Page,fun.aggregate=length,value_var='Page' )
>
> 2. Create a correlation matrix of preferences
> cor(table2[,-1])
>
> How would I restrict my correlation to show preferences that met a minimum
> sample threshold? Can you confirm if the two following commands do the same
> thing? What would I do from here (or am I taking the wrong approach)
> table1<- dcast(data,Page ~ Page,fun.aggregate=length,value_var='Page' )
> table2<- with(data, table(Page,Page))
>
>
Hi Peter,
An easy way to visualize set intersections is the intersectDiagram 
function in the plotrix package. This will display the counts or 
percentages of each type of intersection. Your data could be passed like 
this:

choices<-data.frame(IDs=sample(1:20,50,TRUE),
  sample(LETTERS[1:4],50,TRUE))
library(plotrix)
intersectDiagram(choices)

This example is a bit messy, as it will generate quite a few repeated 
choices that will be ignored by intersectDiagram, but it should give you 
the idea.

Jim