[Bioc-devel] any interest in a BiocMatrix core package?

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Wed Nov 1 21:08:28 CET 2017


I think it makes sense. A lot of sense. Might be useful to involve Henrik
(matrixStats) as well.

Who are the players, apart from DelayedArray/DelayedMatrixStats and matter?
(and some very old stuff in Biobase which should really be deprecated in
favor of matrixStats).

Best,
Kasper

On Wed, Nov 1, 2017 at 3:03 PM, Bemis, Kylie <k.bemis at northeastern.edu>
wrote:

> Hi all,
>
> To continue a variant of this conversation, with the latest BioC release,
> we now have quite a few packages that are implementing various
> matrix-related S4 generic functions, many of them relying on matrixStats as
> a template.
>
> I was wondering if there is any interest or intention to create a common
> MatrixGenerics/ArrayGenerics package on which we can depend to import the
> relevant S4 generic functions. Although BiocGeneric has a few like
> ‘rowSums()’ and ‘colMeans()’, etc., there are many more that are
> implemented across ‘DelayedArray', ‘DelayedMatrixStats', my own package
> ‘matter', etc., including ‘apply()’, ‘rowSds()’, ‘colVars()’, and so forth.
>
> It would be nice to have a single package with minimal additional
> dependencies (a la BiocGenerics) where we could import the various S4
> generics and avoid unwanted namespace collisions.
>
> Have there been any thoughts on this?
>
> Many thanks,
> Kylie
>
> ~~~
> Kylie Ariel Bemis
> Future Faculty Fellow
> College of Computer and Information Science
> Northeastern University
> kuwisdelu.github.io<https://kuwisdelu.github.io>
>
>
>
>
> On Mar 3, 2017, at 11:27 AM, Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com<mailto:kasperdanielhansen at gmail.com>> wrote:
>
>
>
> On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey <stvjc at channing.harvard.edu
> <mailto:stvjc at channing.harvard.edu>> wrote:
>
>
> On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com<mailto:kasperdanielhansen at gmail.com>> wrote:
> Some comment on Aaron's stuff
>
> One possibility for doing things like this is if your code can be done in
> C++ using a subset of rows or columns.  That can sometimes give the
> necessary speed up.  What I mean is this
>
> Say you can safely process 1000 cells (not matrix cells, but biological
> cells, aka columns) at a time in RAM
>
> iterate in R:
>   get chunk i containing 1000 cells from the backend data storage
>   do something on this sub matrix where everything is in a normal matrix
> and you just use C++
>   write results out to whatever backend you're using
>
> Then, with a million cells you iterate over 1000 chunks in R.  And you
> don't need to "touch" the full dataset which can be stored on an arbitrary
> backend.
>
> you "touch" it, but you never ingest the whole thing at any time, is that
> what you mean?
>
> Yes, you load the chunk into RAM and then just deal with it.
>
> Think of doing 10^10 linear models.  If this was 10^6 I would just use
> lmFit.  But 10^10 doesn't fit into memory.  So I load 10^7 into memory, run
> lmFit, store results, redo.  This is bound to be much more efficient than
> loading a single row into memory and doing lm 10^10 times, because lmFit is
> written to do many linear models at the same time.
>
> I am suggesting that this is a potential general strategy.
>
>
> And this approach could be run even (potentially) with different chunks on
> different nodes.
>
> that seems to me to be an important if not essential desideratum.
>
> what then is the role of C++?  extracting a chunk?  preexisting utilities?
>
> When I say C++ I just mean write an efficient implementation that works on
> a chunk, like lmFit.  It is true that anything that works on a chunk will
> work on a single row/column (like lmFit) but there are possibilities for
> optimization when you work at the chunk level.
>
> Obviously not all computations can be done chunkwise.  But for those that
> can, this is a strategy which is independent of the data backend.
>
> I wonder whether this "obviously not" needs to be rethought.  Algorithms
> that are implemented to work with data holistically may need
> to be reexpressed so that they can succeed with chunkwise access.  Is this
> a new mindset needed for holist developers, or can the
> effective data decompositions occur autonomously?
>
> Well, I would say it is obvious that not all computations can be done
> chunkwise.  But of course, in the limit of extremely large data, algorithms
> which needs to cycle over everything no longer scale.  So in that case all
> practical computations can be done chunkwise, out of necessity.  For single
> cell right now where it is just millions of cells on the horizon people
> will think that they can get "standard" holistic approaches to work (and
> that is probably true).  If they had a billion cells they probably wouldn't
> think about that.
>
> Kasper
>
> If you need direct access to the data in the backend in C++  it will be
> extremely backend dependent what is fast and how to do it.  That doesn't
> mean we shouldn't do it though.
>
> Best,
> Kasper
>
>
>
> On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey <stvjc at channing.harvard.edu<
> mailto:stvjc at channing.harvard.edu>> wrote:
> Kylie, thanks for reminding us of matter -- I saw you speak about this at
> the first Bioconductor Boston Meetup, but it
> went like lightning.   For developers contemplating an approach to
> representing high-volume rectangular data,
> where there is no dominant legacy format, it is natural to wonder whether
> HDF5 would be adequate, and,
> further, to wonder how to demonstrate that it is or is not dominated by
> some other approach for a given set
> of tasks.  Should we devise a set of bioinformatic benchmark problems to
> foster comparison and informed
> decisionmaking?  @becker.gabe: is ALTREP far enough along that one could
> contemplate benchmarking with it?
>
> On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie <k.bemis at northeastern.edu<
> mailto:k.bemis at northeastern.edu>>
> wrote:
>
> > It’s not there yet, but I plan to expose a C++ API for my disk-backed
> > matrix objects in the next version of my ‘matter’ package.
> >
> > It’s getting easier to interchange matter/HDF5Array/bigmemory/etc.
> > objects at the R level, especially if using a frontend like DelayedArray
> on
> > top of them, but it would be nice to have a common C++ API that I could
> > hook into as well (a la Rcpp), so new C/C++ could be re-used across
> various
> > backends more easily.
> >
> > Kylie
> >
> > ~~~
> > Kylie Ariel Bemis
> > Future Faculty Fellow
> > College of Computer and Information Science
> > Northeastern University
> > kuwisdelu.github.io<http://kuwisdelu.github.io/><https://
> kuwisdelu.github.io<https://kuwisdelu.github.io/>>
> >
> >
> >
> >
> > On Feb 24, 2017, at 4:50 PM, Aaron Lun <alun at wehi.edu.au<mailto:alun@
> wehi.edu.au><mailto:alun@<mailto:alun@>
> > wehi.edu.au<http://wehi.edu.au/>>> wrote:
> >
> > It's a good place to start, though it would be very handy to have a C(++)
> > API that can be linked against. I'm not sure how much work that would
> > entail but it would give downstream developers a lot more options. Sort
> of
> > like how we can link to Rhtslib, which speeds up a lot of BAM file
> > processing, instead of just relying on Rsamtools.
> >
> >
> > -Aaron
> >
> > ________________________________
> > From: Tim Triche, Jr. <tim.triche at gmail.com<mailto:tim.triche at gmail.com
> ><mailto:tim.triche at gmail.com<mailto:tim.triche at gmail.com>>>
> > Sent: Saturday, 25 February 2017 8:34:58 AM
> > To: Aaron Lun
> > Cc: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org><mailto:
> bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>>
> > Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
> >
> > yes
> >
> > the DelayedArray framework that handles HDF5Array, etc. seems like the
> > right choice?
> >
> > --t
> >
> > On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <alun at wehi.edu.au<mailto:
> alun at wehi.edu.au><mailto:alun@<mailto:alun@>
> > wehi.edu.au<http://wehi.edu.au/>><mailto:alun at wehi.edu.au<mailto:
> alun at wehi.edu.au>>> wrote:
> > Hi everyone,
> >
> > I just attended the Human Cell Atlas meeting in Stanford, and people were
> > talking about gene expression matrices for >1 million cells. If we assume
> > that we can get non-zero expression profiles for ~5000 genes, we�d be
> > talking about a 5000 x 1 million matrix for the raw count data. This
> would
> > be 20-40 GB in size, which would clearly benefit from sparse (via Matrix)
> > or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5, etc.).
> >
> > I�m wondering whether there is any appetite amongst us for making a
> > consistent BioC API to handle these matrices, sort of like what
> > BiocParallel does for multicore and snow. It goes without saying that the
> > different matrix representations should have consistent functions at the
> R
> > level (rbind/cbind, etc.) but it would also be nice to have an integrated
> > C/C++ API (accessible via LinkedTo). There�s many non-trivial things that
> > can be done with this type of data, and it is often faster and more
> memory
> > efficient to do these complex operations in compiled code.
> >
> > I was thinking of something that you could supply any supported matrix
> > representation to a registered function via .Call; the C++ constructor
> > would recognise the type of matrix during class instantiation; and
> > operations (row/column/random read access, also possibly various ways of
> > writing a matrix) would be overloaded and behave as required for the
> class.
> > Only the implementation of the API would need to care about the nitty
> > gritty of each representation, and we would all be free to write code
> that
> > actually does the interesting analytical stuff.
> >
> > Anyway, just throwing some thoughts out there. Any comments appreciated.
> >
> > Cheers,
> >
> > Aaron
> >
> >        [[alternative HTML version deleted]]
> >
> >
> > _______________________________________________
> > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:
> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>><mailto:
> > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>> mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
> >
> >         [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list