[Bioc-devel] any interest in a BiocMatrix core package?

Aaron Lun alun at wehi.edu.au
Thu Mar 2 20:05:46 CET 2017

I'll give two examples from the scran package. In both cases, the count 
matrix is such that rows are genes and columns are cells. The first 
example involves cell cycle phase assignment (from the cyclone() 
function, FYI). Briefly, upon entry to C++, the function:

1. Loops through the cells, one at a time.
2. For each cell, it applies a classifier to the counts for that cell 
(i.e., a column of the count matrix). This is not a straightforward 
operation and also involves a number of random permutations.
3. Returns a set of scores representing the phase assignment.

For a few cells, I could conceivably move the loop into R and just 
supply the column counts for each cell via .Call, which would avoid the 
need to interact with the matrix in C++. However, if I were to process 
one million cells, the slowness of R's loops would really hurt.

The second example involves normalization using a pooling and 
deconvolution algorithm (from the computeSumFactors() function). Upon 
entry into C++, the function:

1. Loops through an ordered set of cells.
2. At each cell, the neighbouring set of 20-100 cells defines a sliding 
window. Counts for all cells in the window are summed together to create 
a pooled expression profile.
3. The pooled profile is used to obtain a size factor, by computing the 
median of the ratios between the pool and a pseudo-cell.
4. This is repeated for all cells in the set (i.e., all positions of the 
window). Each window corresponds to a pool; the function stores the 
identity of the cells in the pool and the size factor for the pool.

The output is used to construct a linear system at the R level, which is 
solved to obtain cell-specific size factors. Again, the work done within 
the loop is not obviously vectorizable with standard functions.

All of the cases I work with involve processing one row or column at a 
time; I generally don't do matrix operations that require random access, 
at least not at the C++ level.

Another motivation for moving into C++ is the greater control over 
memory management. For a decent number of cells, this can make the 
difference between something being runnable or not.



On 02/03/17 18:09, Wolfgang Huber wrote:
> Aaron
> Can you describe use cases, i.e. intended computations on these
> matrices, esp. those for which C++ access is needed for?
> I'm asking b/c the goals of efficient code and abstraction from how the
> data are stored may be conflicting - in which case critical algorithms
> may end up circumventing a prematurely defined API.
> Wolfgang
> 25.2.17 00:37, Aaron Lun scripsit:
>> Yes, I think double-precision would be necessary for general use. Only
>> the raw count data would be integer, and even then that's not
>> guaranteed (e.g., if people are using kallisto or salmon for
>> quantification).
>> -Aaron
>> ________________________________
>> From: Vincent Carey <stvjc at channing.harvard.edu>
>> Sent: Saturday, 25 February 2017 9:25 AM
>> To: Aaron Lun
>> Cc: Tim Triche, Jr.; bioc-devel at r-project.org
>> Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
>> What is the data type for an expression value?  Is it assumed that
>> double precision will be needed?
>> On Fri, Feb 24, 2017 at 4:50 PM, Aaron Lun
>> <alun at wehi.edu.au<mailto:alun at wehi.edu.au>> wrote:
>> It's a good place to start, though it would be very handy to have a
>> C(++) API that can be linked against. I'm not sure how much work that
>> would entail but it would give downstream developers a lot more
>> options. Sort of like how we can link to Rhtslib, which speeds up a
>> lot of BAM file processing, instead of just relying on Rsamtools.
>> -Aaron
>> ________________________________
>> From: Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com>>
>> Sent: Saturday, 25 February 2017 8:34:58 AM
>> To: Aaron Lun
>> Cc: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>
>> Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
>> yes
>> the DelayedArray framework that handles HDF5Array, etc. seems like the
>> right choice?
>> --t
>> On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun
>> <alun at wehi.edu.au<mailto:alun at wehi.edu.au><mailto:alun at wehi.edu.au<mailto:alun at wehi.edu.au>>>
>> wrote:
>> Hi everyone,
>> I just attended the Human Cell Atlas meeting in Stanford, and people
>> were talking about gene expression matrices for >1 million cells. If
>> we assume that we can get non-zero expression profiles for ~5000
>> genes, we�d be talking about a 5000 x 1 million matrix for the raw
>> count data. This would be 20-40 GB in size, which would clearly
>> benefit from sparse (via Matrix) or disk-backed representations
>> (bigmatrix, BufferedMatrix, rhdf5, etc.).
>> I�m wondering whether there is any appetite amongst us for making a
>> consistent BioC API to handle these matrices, sort of like what
>> BiocParallel does for multicore and snow. It goes without saying that
>> the different matrix representations should have consistent functions
>> at the R level (rbind/cbind, etc.) but it would also be nice to have
>> an integrated C/C++ API (accessible via LinkedTo). There�s many
>> non-trivial things that can be done with this type of data, and it is
>> often faster and more memory efficient to do these complex operations
>> in compiled code.
>> I was thinking of something that you could supply any supported matrix
>> representation to a registered function via .Call; the C++ constructor
>> would recognise the type of matrix during class instantiation; and
>> operations (row/column/random read access, also possibly various ways
>> of writing a matrix) would be overloaded and behave as required for
>> the class. Only the implementation of the API would need to care about
>> the nitty gritty of each representation, and we would all be free to
>> write code that actually does the interesting analytical stuff.
>> Anyway, just throwing some thoughts out there. Any comments appreciated.
>> Cheers,
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

More information about the Bioc-devel mailing list