[Bioc-devel] any interest in a BiocMatrix core package?

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Fri Mar 3 17:27:13 CET 2017


On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey <stvjc at channing.harvard.edu>
wrote:

>
>
> On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com> wrote:
>
>> Some comment on Aaron's stuff
>>
>> One possibility for doing things like this is if your code can be done in
>> C++ using a subset of rows or columns.  That can sometimes give the
>> necessary speed up.  What I mean is this
>>
>> Say you can safely process 1000 cells (not matrix cells, but biological
>> cells, aka columns) at a time in RAM
>>
>> iterate in R:
>>   get chunk i containing 1000 cells from the backend data storage
>>   do something on this sub matrix where everything is in a normal matrix
>> and you just use C++
>>   write results out to whatever backend you're using
>>
>> Then, with a million cells you iterate over 1000 chunks in R.  And you
>> don't need to "touch" the full dataset which can be stored on an arbitrary
>> backend.
>>
>
> you "touch" it, but you never ingest the whole thing at any time, is that
> what you mean?
>

Yes, you load the chunk into RAM and then just deal with it.

Think of doing 10^10 linear models.  If this was 10^6 I would just use
lmFit.  But 10^10 doesn't fit into memory.  So I load 10^7 into memory, run
lmFit, store results, redo.  This is bound to be much more efficient than
loading a single row into memory and doing lm 10^10 times, because lmFit is
written to do many linear models at the same time.

I am suggesting that this is a potential general strategy.


And this approach could be run even (potentially) with different chunks on
>> different nodes.
>>
>
> that seems to me to be an important if not essential desideratum.
>
> what then is the role of C++?  extracting a chunk?  preexisting utilities?
>

When I say C++ I just mean write an efficient implementation that works on
a chunk, like lmFit.  It is true that anything that works on a chunk will
work on a single row/column (like lmFit) but there are possibilities for
optimization when you work at the chunk level.

Obviously not all computations can be done chunkwise.  But for those that
>> can, this is a strategy which is independent of the data backend.
>>
>
> I wonder whether this "obviously not" needs to be rethought.  Algorithms
> that are implemented to work with data holistically may need
> to be reexpressed so that they can succeed with chunkwise access.  Is this
> a new mindset needed for holist developers, or can the
> effective data decompositions occur autonomously?
>

Well, I would say it is obvious that not all computations can be done
chunkwise.  But of course, in the limit of extremely large data, algorithms
which needs to cycle over everything no longer scale.  So in that case all
practical computations can be done chunkwise, out of necessity.  For single
cell right now where it is just millions of cells on the horizon people
will think that they can get "standard" holistic approaches to work (and
that is probably true).  If they had a billion cells they probably wouldn't
think about that.

Kasper

If you need direct access to the data in the backend in C++  it will be
>> extremely backend dependent what is fast and how to do it.  That doesn't
>> mean we shouldn't do it though.
>>
>> Best,
>> Kasper
>>
>>
>>
>> On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey <stvjc at channing.harvard.edu
>> > wrote:
>>
>>> Kylie, thanks for reminding us of matter -- I saw you speak about this at
>>> the first Bioconductor Boston Meetup, but it
>>> went like lightning.   For developers contemplating an approach to
>>> representing high-volume rectangular data,
>>> where there is no dominant legacy format, it is natural to wonder whether
>>> HDF5 would be adequate, and,
>>> further, to wonder how to demonstrate that it is or is not dominated by
>>> some other approach for a given set
>>> of tasks.  Should we devise a set of bioinformatic benchmark problems to
>>> foster comparison and informed
>>> decisionmaking?  @becker.gabe: is ALTREP far enough along that one could
>>> contemplate benchmarking with it?
>>>
>>> On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie <k.bemis at northeastern.edu>
>>> wrote:
>>>
>>> > It’s not there yet, but I plan to expose a C++ API for my disk-backed
>>> > matrix objects in the next version of my ‘matter’ package.
>>> >
>>> > It’s getting easier to interchange matter/HDF5Array/bigmemory/etc.
>>> > objects at the R level, especially if using a frontend like
>>> DelayedArray on
>>> > top of them, but it would be nice to have a common C++ API that I could
>>> > hook into as well (a la Rcpp), so new C/C++ could be re-used across
>>> various
>>> > backends more easily.
>>> >
>>> > Kylie
>>> >
>>> > ~~~
>>> > Kylie Ariel Bemis
>>> > Future Faculty Fellow
>>> > College of Computer and Information Science
>>> > Northeastern University
>>> > kuwisdelu.github.io<https://kuwisdelu.github.io>
>>> >
>>> >
>>> >
>>> >
>>> > On Feb 24, 2017, at 4:50 PM, Aaron Lun <alun at wehi.edu.au<mailto:alun@
>>> > wehi.edu.au>> wrote:
>>> >
>>> > It's a good place to start, though it would be very handy to have a
>>> C(++)
>>> > API that can be linked against. I'm not sure how much work that would
>>> > entail but it would give downstream developers a lot more options.
>>> Sort of
>>> > like how we can link to Rhtslib, which speeds up a lot of BAM file
>>> > processing, instead of just relying on Rsamtools.
>>> >
>>> >
>>> > -Aaron
>>> >
>>> > ________________________________
>>> > From: Tim Triche, Jr. <tim.triche at gmail.com<mailto:t
>>> im.triche at gmail.com>>
>>> > Sent: Saturday, 25 February 2017 8:34:58 AM
>>> > To: Aaron Lun
>>> > Cc: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>
>>> > Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
>>> >
>>> > yes
>>> >
>>> > the DelayedArray framework that handles HDF5Array, etc. seems like the
>>> > right choice?
>>> >
>>> > --t
>>> >
>>> > On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <alun at wehi.edu.au<mailto:
>>> alun@
>>> > wehi.edu.au><mailto:alun at wehi.edu.au>> wrote:
>>> > Hi everyone,
>>> >
>>> > I just attended the Human Cell Atlas meeting in Stanford, and people
>>> were
>>> > talking about gene expression matrices for >1 million cells. If we
>>> assume
>>> > that we can get non-zero expression profiles for ~5000 genes, we�d be
>>> > talking about a 5000 x 1 million matrix for the raw count data. This
>>> would
>>> > be 20-40 GB in size, which would clearly benefit from sparse (via
>>> Matrix)
>>> > or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5,
>>> etc.).
>>> >
>>> > I�m wondering whether there is any appetite amongst us for making a
>>> > consistent BioC API to handle these matrices, sort of like what
>>> > BiocParallel does for multicore and snow. It goes without saying that
>>> the
>>> > different matrix representations should have consistent functions at
>>> the R
>>> > level (rbind/cbind, etc.) but it would also be nice to have an
>>> integrated
>>> > C/C++ API (accessible via LinkedTo). There�s many non-trivial things
>>> that
>>> > can be done with this type of data, and it is often faster and more
>>> memory
>>> > efficient to do these complex operations in compiled code.
>>> >
>>> > I was thinking of something that you could supply any supported matrix
>>> > representation to a registered function via .Call; the C++ constructor
>>> > would recognise the type of matrix during class instantiation; and
>>> > operations (row/column/random read access, also possibly various ways
>>> of
>>> > writing a matrix) would be overloaded and behave as required for the
>>> class.
>>> > Only the implementation of the API would need to care about the nitty
>>> > gritty of each representation, and we would all be free to write code
>>> that
>>> > actually does the interesting analytical stuff.
>>> >
>>> > Anyway, just throwing some thoughts out there. Any comments
>>> appreciated.
>>> >
>>> > Cheers,
>>> >
>>> > Aaron
>>> >
>>> >        [[alternative HTML version deleted]]
>>> >
>>> >
>>> > _______________________________________________
>>> > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:
>>> > Bioc-devel at r-project.org> mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> >
>>> >
>>> > [[alternative HTML version deleted]]
>>> >
>>> > _______________________________________________
>>> > Bioc-devel at r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> >
>>> >
>>> >         [[alternative HTML version deleted]]
>>> >
>>> > _______________________________________________
>>> > Bioc-devel at r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> >
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list