[R] pairwise cross tabulation tables

AndyZon andyzon at gmail.com
Thu Jan 10 22:15:04 CET 2008


Thank you so much, Chuck!

This is brilliant, I just tried some dichotomous variables, it was really
fast.

Most categorical variables I am interested in are 3 levels, they are
actually SNPs, I want to look at their interactions. My question is: after
generating 0-1 codings, like 00, 01, 10, how should I use "crossprod()"?
Should I just apply this function on these 2*n columns (originally I have n
variables), and then operate on the generated cell counts?  I am confused
about this.

Your input will be greatly appreciated.

Andy



Charles C. Berry wrote:
> 
> On Wed, 9 Jan 2008, AndyZon wrote:
> 
>>
>> Hi,
>>
>> I have a huge number of categorical variables, say at least 10000, and I
>> put
>> them into a matrix, each column is one variable. The question is: how can
>> I
>> make all of the pairwise cross tabulation tables efficiently? The
>> straightforward solution is to use for-loops by looping two indexes on
>> the
>> table() function, but it was just too slow. Is there a more efficient way
>> to
>> do that? Any guidance will be greatly appreciated.
> 
> The totals are merely the crossproducts of a suitably constructed binary 
> (zero-one) matrix is used to encode the categories. See '?contr.treatment' 
> if you cannot grok 'suitably constructed'.
> 
> If the categories are all dichotomies coded as 0:1, you can use
> 
>  	res <- crossprod( dat )
> 
> to find the totals for the (1,1) cells
> 
> If you need the full tables, you can get them from the marginal totals 
> using
> 
>  	diag( res )
> 
> to get the number in each '1' margin and
> 
>  	nrow(dat)
> 
> to get the table total from which the numbers in each '0' margin by 
> subtracting the corresponding '1' margin.
> 
> With dichotomous variables, dat has 10000 columns and you will only need 
> 10000^2 integers or about 0.75 Gigabytes to store the 'res'. And it takes 
> about 20 seconds to run 1000 rows on my MacBook. Of course, 'res' has a 
> redundant triangle
> 
> This approach generalizes to any number of categories:
> 
> To extend this to more than two categories, you will need to do for each 
> such column what model.matrix(~factor( dat[,i] ) ) does by default
> ( using 'contr.treatment' ) - construct zero-one codes for all but one 
> (reference) category.
> 
> Note that with 10000 trichotomies, you will have a result with
> 
>  	10000^2 * ( 3-1 )^2
> 
> integers needing about 3 Gigabytes, and so on.
> 
> HTH,
> 
> Chuck
> 
> p.s. Why on Earth are you doing this????
> 
> 
>>
>> Andy
>> -- 
>> View this message in context:
>> http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14723520.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> Charles C. Berry                            (858) 534-2098
>                                              Dept of Family/Preventive
> Medicine
> E mailto:cberry at tajo.ucsd.edu	            UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14744086.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list