[R] bigmemory - extracting submatrix from big.matrix object

Wed Jun 3 15:11:37 CEST 2009

Utkarsh,

Thanks again for the feedback and suggestions on bigmemory.

A follow-up on counting NAs: we have exposed a new function colna()
to the user in upcoming release 3.7.  Of course mwhich() can still be
helpful.

As for the last topic -- applying any function to columns of a big.matrix
object.  Once you peel away the shell, a big.matrix column
is identical to an R matrix column (or vector) -- a pointer and a length
and knowledge of the type is sufficient.  Because we (ideally) want to
support our current 4 types (and hopefully add complex and maybe
more, soon), we rely on C++ template functions for the summaries we
have implemented to date.  But yes, looking at our implementation
of colmean(), for example, would be a good place to start.

Keep in mind that there are differences between big.matrix objects
and R internals.  bigmemory indexes everything using longs instead of
integers (and uses numerics when passing indices between R and
C/C++).  So simply using an existing R function (or the C function  under
the hood of R) would be limiting not only with respect to the various
types of big.matrix objects, but also with respect to the size.  On 64-bit
R platforms, there is no practical limit to the size of a filebacked big.matrix
(other than your disk space or filesystem limitations).  But R won't handle
vectors in excess of 2 billion elements, even if you have the RAM to
support such beasts.  Operating on chunks within R is of course another
possibility.

Further discussion of development ideas would be great, but should
probably be moved offine or over to R-devel.  As always, we appreciate
feedback, complaints, bug reports, etc...

Thanks,

Jay

On Wed, Jun 3, 2009 at 3:16 AM, utkarshsinghal
<utkarsh.singhal at global-analytics.com> wrote:
> Thanks for the really valuable inputs, developing the package and updating
> it regularly. I will be glad if I can contribute in any way.
>
> In problem three, however, I am interested in knowing a generic way to apply
> any function on columns of a big.matrix object (obviously without loading
> the data into R). May be the source code of the function "colmean" can help,
> if that is not too much to ask for. Or if we can develop a function similar
> to "apply" of the base R.
>
>
> Regards
> Utkarsh
>
>
>
>
> Jay Emerson wrote:
>>
>> We also have ColCountNA(), which is not currently exposed to the user
>> but will be in the next version.
>>
>> Jay
>>
>> On Tue, Jun 2, 2009 at 2:08 PM, Jay Emerson <jayemerson at gmail.com> wrote:
>>
>>>
>>> Thanks for trying this out.
>>>
>>> Problem 1.  We'll check this.  Options should certainly be available.
>>>  Thanks!
>>>
>>> Problem 2. Fascinating.  We just (yesterday) implemented a
>>> sub.big.matrix() function doing exactly
>>> this, creating something that is a big matrix but which just
>>> references a contiguous subset of the
>>> original matrix.  This will be available in an upcoming version
>>> (hopefully in the next week).  A more
>>> specialized function would create an entirely new big.matrix from a
>>> subset of a first big.matrix,
>>> making an actual copy, but this is something else altogether. You
>>> could do this entirely within R
>>> without much work, by the way, and only 2* memory overhead.
>>>
>>> Problem 3. You can count missing values using mwhich().  For other
>>> exploration (e.g. skewness)
>>> at the moment you should just extract a single column (variable)  at a
>>> time into R, study it, then get the
>>> next column, etc... .  We will not be implementing all of R's
>>> functions directly with big.matrix objects.
>>> We will be creating a new package "bigmemoryAnalytics" and would
>>> welcome contributions to the
>>> package.
>>>
>>> Feel free to email us directly with bugs, questions, etc...
>>>
>>> Cheers,
>>>
>>> Jay
>>>
>>>
>>> ----------------------------------------------------------
>>>
>>> From: utkarshsinghal <utkarsh.singhal at global-analytics.com>
>>> Date: Tue, Jun 2, 2009 at 8:25 AM
>>> Subject: [R] bigmemory - extracting submatrix from big.matrix object
>>> To: r help <r-help at r-project.org>
>>> I am using the library(bigmemory) to handle large datasets, say 1 GB,
>>> and facing following problems. Any hints from anybody can be helpful.
>>> _Problem-1:
>>> _
>>> I am using "read.big.matrix" function  to create a filebacked big
>>> matrix of my data and get the following warning:
>>>
>>>>
>>>> x =
>>>> read.big.matrix("/home/utkarsh.s/data.csv",header=T,type="double",shared=T,backingfile
>>>> = "backup", backingpath = "/home/utkarsh.s")
>>>>
>>>
>>> Warning message:
>>> In filebacked.big.matrix(nrow = numRows, ncol = numCols, type = type,  :
>>>  A descriptor file has not been specified.  A descriptor named
>>> backup.desc will be created.
>>> However there is no such argument in "read.big.matrix". Although there
>>> is an argument "descriptorfile" in the function "as.big.matrix" but if
>>> I try to use it in "read.big.matrix", I get an error showing it as
>>> unused argument (as expected).
>>> _Problem-2:_
>>> I want to get a filebacked *sub*matrix of "x", say only selected
>>> columns: x[, 1:100]. Is there any way of doing that without actually
>>> loading the data into R memory.
>>> _
>>> Problem-3
>>> _There are functions available like:  summary, colmean, colsd, ... for
>>> standard summary statistics. But is there any way to calculate other
>>> summaries say number of missing values or skewness of each variable,
>>> without loading the whole data into R memory.
>>> Regards
>>> Utkarsh
>>>
>>> --
>>> John W. Emerson (Jay)
>>> Assistant Professor of Statistics
>>> Department of Statistics
>>> Yale University
>>> http://www.stat.yale.edu/~jay
>>>
>>>
>>
>>
>>
>>
>
>

-- 
John W. Emerson (Jay)
Assistant Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay