[R-sig-hpc] Handling data with thousands of variables
Håvard Wahl Kongsgård
haavard.kongsgaard at gmail.com
Sat Jul 2 21:31:46 CEST 2011
OK, thanks will try using a sparse matrix or a boolean array.
Hopefully I will be able to convert to 1 and -1 (I think most svm
implementations use that format for binary data) on the fly.
2011/7/2 Han De Vries <handevries at gmail.com>:
> Indeed, if you would store this as a regular 100M x 20K matrix you
> would run into memory problems. However, the sparse matrix storage
> format might reduce its size by a factor 10, 100 or more. The exact
> amount will depend on the sparseness of your data, i.e. the average
> number of keywords per record. So, if you would have 20K unique
> keywords, with an average of 10 keywords per record, each row of the
> matrix would contain 19990 zeroes on average. These zeroes don't use
> memory in a sparse matrix format, because only the non-zero cells and
> their row/column coordinates are stored.
> 2011/7/2 Håvard Wahl Kongsgård <haavard.kongsgaard at gmail.com>:
>> Ok, but what about memory usage. For now I have implemented my
>> analysis in python with numpy arrays with only 100 000 cases and 10
>> 000 keywords.
>> But the memory required for large array and matrix is massive. In R
>> one possibility is the Bigmemory library,
>> but it's slow and if I remember correctly the bigmemory matrix is not
>> supported by other R libraries.
>> On Fri, Jul 1, 2011 at 10:02 AM, Han De Vries <handevries at gmail.com> wrote:
>>> Perhaps you want to store your data in a big 10 mln rows x 20000
>>> columns matrix, where each cell is 1 when the corresponding keyword
>>> applies to a record, zero otherwise. Because you will end up with very
>>> many zeroes, such a matrix can be stored as a sparse matrix in an
>>> efficient way (using the Matrix package). The Matrix package itself
>>> offers various analytical tools to quickly summarize by rows or
>>> columns, and many other types of estimations as long as they can be
>>> translated to matrix operations (like linear regression). Some other
>>> packages, such as glmnet, can read these matrices directly for more
>>> specific analyses. If you have sufficient memory (you want to keep the
>>> entire sparse matrix in memory), handling the data can be really fast.
>>> Because you're asking about personal experiences: I have been using
>>> this approach with (sparse) matrices up to a few million rows
>>> (records) and 20K columns (variables).
>>> Kind regards,
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
More information about the R-sig-hpc