[R-sig-hpc] Handling data with thousands of variables

Sat Jul 2 20:37:26 CEST 2011

Indeed, if you would store this as a regular 100M x 20K matrix you
would run into memory problems. However, the sparse matrix storage
format might reduce its size by a factor 10, 100 or more. The exact
amount will depend on the sparseness of your data, i.e. the average
number of keywords per record. So, if you would have 20K unique
keywords, with an average of 10 keywords per record, each row of the
matrix would contain 19990 zeroes on average. These zeroes don't use
memory in a sparse matrix format, because only the non-zero cells and
their row/column coordinates are stored.

2011/7/2 Håvard Wahl Kongsgård <haavard.kongsgaard at gmail.com>:
> Ok, but what about memory usage. For now I have implemented my
> analysis in python with numpy arrays with only 100 000 cases and 10
> 000 keywords.
> But the memory required for large array and matrix is massive. In R
> one possibility is the Bigmemory library,
> but it's slow and if I remember correctly the bigmemory matrix is not
> supported by other R libraries.
>
> -Håvard
>
>
> On Fri, Jul 1, 2011 at 10:02 AM, Han De Vries <handevries at gmail.com> wrote:
>> Perhaps you want to store your data in a big 10 mln rows x 20000
>> columns matrix, where each cell is 1 when the corresponding keyword
>> applies to a record, zero otherwise. Because you will end up with very
>> many zeroes, such a matrix can be stored as a sparse matrix in an
>> efficient way (using the Matrix package). The Matrix package itself
>> offers various analytical tools to quickly summarize by rows or
>> columns, and many other types of estimations as long as they can be
>> translated to matrix operations (like linear regression). Some other
>> packages, such as glmnet, can read these matrices directly for more
>> specific analyses. If you have sufficient memory (you want to keep the
>> entire sparse matrix in memory), handling the data can be really fast.
>>
>> Because you're asking about personal experiences: I have been using
>> this approach with (sparse) matrices up to a few million rows
>> (records) and 20K columns (variables).
>>
>> Kind regards,
>> Han
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>
>