[Bioc-devel] any interest in a BiocMatrix core package?

Wolfgang Huber wolfgang.huber at embl.de
Fri Mar 3 09:46:11 CET 2017


Dear Aaron

Thank you. I think it's an important simplification of a potential API 
when you are saying that what you mostly need are accessors
   m[i, ] and m[, i]
with i scalar or a short contiguous range, such that the value of that 
could be a relatively small ordinary matrix. (Compared to operations 
like matrix multiplication, SVD or other decompositions.)

	Wolfgang

PS Loops per se in today's R are not as slow as some think: depending on 
the algorithm, the time "wasted" by the R interpreter on looking up 
symbols etc may (or may not) be negligible compared to the actual 
computations that are done at the C level anyway:

g = function(n) {
     s = 0
     for (i in seq_len(n))
         s = s + i
     s
}

cg = compiler::cmpfun(g)

print(system.time( g(1e6)))
    user  system elapsed
   0.161   0.000   0.161

print(system.time(cg(1e6)))
   user  system elapsed
   0.043   0.000   0.043



2.3.17 20:05, Aaron Lun scripsit:
> I'll give two examples from the scran package. In both cases, the count
> matrix is such that rows are genes and columns are cells. The first
> example involves cell cycle phase assignment (from the cyclone()
> function, FYI). Briefly, upon entry to C++, the function:
>
> 1. Loops through the cells, one at a time.
> 2. For each cell, it applies a classifier to the counts for that cell
> (i.e., a column of the count matrix). This is not a straightforward
> operation and also involves a number of random permutations.
> 3. Returns a set of scores representing the phase assignment.
>
> For a few cells, I could conceivably move the loop into R and just
> supply the column counts for each cell via .Call, which would avoid the
> need to interact with the matrix in C++. However, if I were to process
> one million cells, the slowness of R's loops would really hurt.
>
> The second example involves normalization using a pooling and
> deconvolution algorithm (from the computeSumFactors() function). Upon
> entry into C++, the function:
>
> 1. Loops through an ordered set of cells.
> 2. At each cell, the neighbouring set of 20-100 cells defines a sliding
> window. Counts for all cells in the window are summed together to create
> a pooled expression profile.
> 3. The pooled profile is used to obtain a size factor, by computing the
> median of the ratios between the pool and a pseudo-cell.
> 4. This is repeated for all cells in the set (i.e., all positions of the
> window). Each window corresponds to a pool; the function stores the
> identity of the cells in the pool and the size factor for the pool.
>
> The output is used to construct a linear system at the R level, which is
> solved to obtain cell-specific size factors. Again, the work done within
> the loop is not obviously vectorizable with standard functions.
>
> All of the cases I work with involve processing one row or column at a
> time; I generally don't do matrix operations that require random access,
> at least not at the C++ level.
>
> Another motivation for moving into C++ is the greater control over
> memory management. For a decent number of cells, this can make the
> difference between something being runnable or not.
>
> Cheers,
>
> Aaron
>
> On 02/03/17 18:09, Wolfgang Huber wrote:
>> Aaron
>>
>> Can you describe use cases, i.e. intended computations on these
>> matrices, esp. those for which C++ access is needed for?
>>
>> I'm asking b/c the goals of efficient code and abstraction from how the
>> data are stored may be conflicting - in which case critical algorithms
>> may end up circumventing a prematurely defined API.
>>
>> Wolfgang
>>
>>
>> 25.2.17 00:37, Aaron Lun scripsit:
>>> Yes, I think double-precision would be necessary for general use. Only
>>> the raw count data would be integer, and even then that's not
>>> guaranteed (e.g., if people are using kallisto or salmon for
>>> quantification).
>>>
>>>
>>> -Aaron
>>>
>>>
>>> ________________________________
>>> From: Vincent Carey <stvjc at channing.harvard.edu>
>>> Sent: Saturday, 25 February 2017 9:25 AM
>>> To: Aaron Lun
>>> Cc: Tim Triche, Jr.; bioc-devel at r-project.org
>>> Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
>>>
>>> What is the data type for an expression value?  Is it assumed that
>>> double precision will be needed?
>>>
>>> On Fri, Feb 24, 2017 at 4:50 PM, Aaron Lun
>>> <alun at wehi.edu.au<mailto:alun at wehi.edu.au>> wrote:
>>> It's a good place to start, though it would be very handy to have a
>>> C(++) API that can be linked against. I'm not sure how much work that
>>> would entail but it would give downstream developers a lot more
>>> options. Sort of like how we can link to Rhtslib, which speeds up a
>>> lot of BAM file processing, instead of just relying on Rsamtools.
>>>
>>>
>>> -Aaron
>>>
>>> ________________________________
>>> From: Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com>>
>>> Sent: Saturday, 25 February 2017 8:34:58 AM
>>> To: Aaron Lun
>>> Cc: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>
>>> Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
>>>
>>> yes
>>>
>>> the DelayedArray framework that handles HDF5Array, etc. seems like the
>>> right choice?
>>>
>>> --t
>>>
>>> On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun
>>> <alun at wehi.edu.au<mailto:alun at wehi.edu.au><mailto:alun at wehi.edu.au<mailto:alun at wehi.edu.au>>>
>>> wrote:
>>> Hi everyone,
>>>
>>> I just attended the Human Cell Atlas meeting in Stanford, and people
>>> were talking about gene expression matrices for >1 million cells. If
>>> we assume that we can get non-zero expression profiles for ~5000
>>> genes, we�d be talking about a 5000 x 1 million matrix for the raw
>>> count data. This would be 20-40 GB in size, which would clearly
>>> benefit from sparse (via Matrix) or disk-backed representations
>>> (bigmatrix, BufferedMatrix, rhdf5, etc.).
>>>
>>> I�m wondering whether there is any appetite amongst us for making a
>>> consistent BioC API to handle these matrices, sort of like what
>>> BiocParallel does for multicore and snow. It goes without saying that
>>> the different matrix representations should have consistent functions
>>> at the R level (rbind/cbind, etc.) but it would also be nice to have
>>> an integrated C/C++ API (accessible via LinkedTo). There�s many
>>> non-trivial things that can be done with this type of data, and it is
>>> often faster and more memory efficient to do these complex operations
>>> in compiled code.
>>>
>>> I was thinking of something that you could supply any supported matrix
>>> representation to a registered function via .Call; the C++ constructor
>>> would recognise the type of matrix during class instantiation; and
>>> operations (row/column/random read access, also possibly various ways
>>> of writing a matrix) would be overloaded and behave as required for
>>> the class. Only the implementation of the API would need to care about
>>> the nitty gritty of each representation, and we would all be free to
>>> write code that actually does the interesting analytical stuff.
>>>
>>> Anyway, just throwing some thoughts out there. Any comments appreciated.
>>>
>>> Cheers,
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list