[R-sig-hpc] Handling data with thousands of variables

Håvard Wahl Kongsgård haavard.kongsgaard at gmail.com
Sat Jul 2 19:24:26 CEST 2011


Ok, but what about memory usage. For now I have implemented my
analysis in python with numpy arrays with only 100 000 cases and 10
000 keywords.
But the memory required for large array and matrix is massive. In R
one possibility is the Bigmemory library,
but it's slow and if I remember correctly the bigmemory matrix is not
supported by other R libraries.

-Håvard


On Fri, Jul 1, 2011 at 10:02 AM, Han De Vries <handevries at gmail.com> wrote:
> Perhaps you want to store your data in a big 10 mln rows x 20000
> columns matrix, where each cell is 1 when the corresponding keyword
> applies to a record, zero otherwise. Because you will end up with very
> many zeroes, such a matrix can be stored as a sparse matrix in an
> efficient way (using the Matrix package). The Matrix package itself
> offers various analytical tools to quickly summarize by rows or
> columns, and many other types of estimations as long as they can be
> translated to matrix operations (like linear regression). Some other
> packages, such as glmnet, can read these matrices directly for more
> specific analyses. If you have sufficient memory (you want to keep the
> entire sparse matrix in memory), handling the data can be really fast.
>
> Because you're asking about personal experiences: I have been using
> this approach with (sparse) matrices up to a few million rows
> (records) and 20K columns (variables).
>
> Kind regards,
> Han
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>



More information about the R-sig-hpc mailing list