[R-sig-hpc] Handling data with thousands of variables

Han De Vries handevries at gmail.com
Fri Jul 1 10:02:47 CEST 2011


Perhaps you want to store your data in a big 10 mln rows x 20000
columns matrix, where each cell is 1 when the corresponding keyword
applies to a record, zero otherwise. Because you will end up with very
many zeroes, such a matrix can be stored as a sparse matrix in an
efficient way (using the Matrix package). The Matrix package itself
offers various analytical tools to quickly summarize by rows or
columns, and many other types of estimations as long as they can be
translated to matrix operations (like linear regression). Some other
packages, such as glmnet, can read these matrices directly for more
specific analyses. If you have sufficient memory (you want to keep the
entire sparse matrix in memory), handling the data can be really fast.

Because you're asking about personal experiences: I have been using
this approach with (sparse) matrices up to a few million rows
(records) and 20K columns (variables).

Kind regards,
Han



More information about the R-sig-hpc mailing list