[Bioc-devel] any interest in a BiocMatrix core package?
Peter Hickey
peter.hickey at gmail.com
Wed Nov 1 21:44:11 CET 2017
I think that's a good idea, Kylie.
Pete (DelayedMatrixStats developer)
On Thu., 2 Nov. 2017, 6:09 am Kasper Daniel Hansen, <
kasperdanielhansen at gmail.com> wrote:
> I think it makes sense. A lot of sense. Might be useful to involve Henrik
> (matrixStats) as well.
>
> Who are the players, apart from DelayedArray/DelayedMatrixStats and matter?
> (and some very old stuff in Biobase which should really be deprecated in
> favor of matrixStats).
>
> Best,
> Kasper
>
> On Wed, Nov 1, 2017 at 3:03 PM, Bemis, Kylie <k.bemis at northeastern.edu>
> wrote:
>
> > Hi all,
> >
> > To continue a variant of this conversation, with the latest BioC release,
> > we now have quite a few packages that are implementing various
> > matrix-related S4 generic functions, many of them relying on matrixStats
> as
> > a template.
> >
> > I was wondering if there is any interest or intention to create a common
> > MatrixGenerics/ArrayGenerics package on which we can depend to import the
> > relevant S4 generic functions. Although BiocGeneric has a few like
> > ‘rowSums()’ and ‘colMeans()’, etc., there are many more that are
> > implemented across ‘DelayedArray', ‘DelayedMatrixStats', my own package
> > ‘matter', etc., including ‘apply()’, ‘rowSds()’, ‘colVars()’, and so
> forth.
> >
> > It would be nice to have a single package with minimal additional
> > dependencies (a la BiocGenerics) where we could import the various S4
> > generics and avoid unwanted namespace collisions.
> >
> > Have there been any thoughts on this?
> >
> > Many thanks,
> > Kylie
> >
> > ~~~
> > Kylie Ariel Bemis
> > Future Faculty Fellow
> > College of Computer and Information Science
> > Northeastern University
> > kuwisdelu.github.io<https://kuwisdelu.github.io>
> >
> >
> >
> >
> > On Mar 3, 2017, at 11:27 AM, Kasper Daniel Hansen <
> > kasperdanielhansen at gmail.com<mailto:kasperdanielhansen at gmail.com>>
> wrote:
> >
> >
> >
> > On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey <
> stvjc at channing.harvard.edu
> > <mailto:stvjc at channing.harvard.edu>> wrote:
> >
> >
> > On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen <
> > kasperdanielhansen at gmail.com<mailto:kasperdanielhansen at gmail.com>>
> wrote:
> > Some comment on Aaron's stuff
> >
> > One possibility for doing things like this is if your code can be done in
> > C++ using a subset of rows or columns. That can sometimes give the
> > necessary speed up. What I mean is this
> >
> > Say you can safely process 1000 cells (not matrix cells, but biological
> > cells, aka columns) at a time in RAM
> >
> > iterate in R:
> > get chunk i containing 1000 cells from the backend data storage
> > do something on this sub matrix where everything is in a normal matrix
> > and you just use C++
> > write results out to whatever backend you're using
> >
> > Then, with a million cells you iterate over 1000 chunks in R. And you
> > don't need to "touch" the full dataset which can be stored on an
> arbitrary
> > backend.
> >
> > you "touch" it, but you never ingest the whole thing at any time, is that
> > what you mean?
> >
> > Yes, you load the chunk into RAM and then just deal with it.
> >
> > Think of doing 10^10 linear models. If this was 10^6 I would just use
> > lmFit. But 10^10 doesn't fit into memory. So I load 10^7 into memory,
> run
> > lmFit, store results, redo. This is bound to be much more efficient than
> > loading a single row into memory and doing lm 10^10 times, because lmFit
> is
> > written to do many linear models at the same time.
> >
> > I am suggesting that this is a potential general strategy.
> >
> >
> > And this approach could be run even (potentially) with different chunks
> on
> > different nodes.
> >
> > that seems to me to be an important if not essential desideratum.
> >
> > what then is the role of C++? extracting a chunk? preexisting
> utilities?
> >
> > When I say C++ I just mean write an efficient implementation that works
> on
> > a chunk, like lmFit. It is true that anything that works on a chunk will
> > work on a single row/column (like lmFit) but there are possibilities for
> > optimization when you work at the chunk level.
> >
> > Obviously not all computations can be done chunkwise. But for those that
> > can, this is a strategy which is independent of the data backend.
> >
> > I wonder whether this "obviously not" needs to be rethought. Algorithms
> > that are implemented to work with data holistically may need
> > to be reexpressed so that they can succeed with chunkwise access. Is
> this
> > a new mindset needed for holist developers, or can the
> > effective data decompositions occur autonomously?
> >
> > Well, I would say it is obvious that not all computations can be done
> > chunkwise. But of course, in the limit of extremely large data,
> algorithms
> > which needs to cycle over everything no longer scale. So in that case
> all
> > practical computations can be done chunkwise, out of necessity. For
> single
> > cell right now where it is just millions of cells on the horizon people
> > will think that they can get "standard" holistic approaches to work (and
> > that is probably true). If they had a billion cells they probably
> wouldn't
> > think about that.
> >
> > Kasper
> >
> > If you need direct access to the data in the backend in C++ it will be
> > extremely backend dependent what is fast and how to do it. That doesn't
> > mean we shouldn't do it though.
> >
> > Best,
> > Kasper
> >
> >
> >
> > On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey <
> stvjc at channing.harvard.edu<
> > mailto:stvjc at channing.harvard.edu>> wrote:
> > Kylie, thanks for reminding us of matter -- I saw you speak about this at
> > the first Bioconductor Boston Meetup, but it
> > went like lightning. For developers contemplating an approach to
> > representing high-volume rectangular data,
> > where there is no dominant legacy format, it is natural to wonder whether
> > HDF5 would be adequate, and,
> > further, to wonder how to demonstrate that it is or is not dominated by
> > some other approach for a given set
> > of tasks. Should we devise a set of bioinformatic benchmark problems to
> > foster comparison and informed
> > decisionmaking? @becker.gabe: is ALTREP far enough along that one could
> > contemplate benchmarking with it?
> >
> > On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie <k.bemis at northeastern.edu<
> > mailto:k.bemis at northeastern.edu>>
> > wrote:
> >
> > > It’s not there yet, but I plan to expose a C++ API for my disk-backed
> > > matrix objects in the next version of my ‘matter’ package.
> > >
> > > It’s getting easier to interchange matter/HDF5Array/bigmemory/etc.
> > > objects at the R level, especially if using a frontend like
> DelayedArray
> > on
> > > top of them, but it would be nice to have a common C++ API that I could
> > > hook into as well (a la Rcpp), so new C/C++ could be re-used across
> > various
> > > backends more easily.
> > >
> > > Kylie
> > >
> > > ~~~
> > > Kylie Ariel Bemis
> > > Future Faculty Fellow
> > > College of Computer and Information Science
> > > Northeastern University
> > > kuwisdelu.github.io<http://kuwisdelu.github.io/><https://
> > kuwisdelu.github.io<https://kuwisdelu.github.io/>>
> > >
> > >
> > >
> > >
> > > On Feb 24, 2017, at 4:50 PM, Aaron Lun <alun at wehi.edu.au<mailto:alun@
> > wehi.edu.au><mailto:alun@<mailto:alun@>
> > > wehi.edu.au<http://wehi.edu.au/>>> wrote:
> > >
> > > It's a good place to start, though it would be very handy to have a
> C(++)
> > > API that can be linked against. I'm not sure how much work that would
> > > entail but it would give downstream developers a lot more options. Sort
> > of
> > > like how we can link to Rhtslib, which speeds up a lot of BAM file
> > > processing, instead of just relying on Rsamtools.
> > >
> > >
> > > -Aaron
> > >
> > > ________________________________
> > > From: Tim Triche, Jr. <tim.triche at gmail.com<mailto:
> tim.triche at gmail.com
> > ><mailto:tim.triche at gmail.com<mailto:tim.triche at gmail.com>>>
> > > Sent: Saturday, 25 February 2017 8:34:58 AM
> > > To: Aaron Lun
> > > Cc: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org><mailto:
> > bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>>
> > > Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
> > >
> > > yes
> > >
> > > the DelayedArray framework that handles HDF5Array, etc. seems like the
> > > right choice?
> > >
> > > --t
> > >
> > > On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <alun at wehi.edu.au<mailto:
> > alun at wehi.edu.au><mailto:alun@<mailto:alun@>
> > > wehi.edu.au<http://wehi.edu.au/>><mailto:alun at wehi.edu.au<mailto:
> > alun at wehi.edu.au>>> wrote:
> > > Hi everyone,
> > >
> > > I just attended the Human Cell Atlas meeting in Stanford, and people
> were
> > > talking about gene expression matrices for >1 million cells. If we
> assume
> > > that we can get non-zero expression profiles for ~5000 genes, we�d be
> > > talking about a 5000 x 1 million matrix for the raw count data. This
> > would
> > > be 20-40 GB in size, which would clearly benefit from sparse (via
> Matrix)
> > > or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5,
> etc.).
> > >
> > > I�m wondering whether there is any appetite amongst us for making a
> > > consistent BioC API to handle these matrices, sort of like what
> > > BiocParallel does for multicore and snow. It goes without saying that
> the
> > > different matrix representations should have consistent functions at
> the
> > R
> > > level (rbind/cbind, etc.) but it would also be nice to have an
> integrated
> > > C/C++ API (accessible via LinkedTo). There�s many non-trivial things
> that
> > > can be done with this type of data, and it is often faster and more
> > memory
> > > efficient to do these complex operations in compiled code.
> > >
> > > I was thinking of something that you could supply any supported matrix
> > > representation to a registered function via .Call; the C++ constructor
> > > would recognise the type of matrix during class instantiation; and
> > > operations (row/column/random read access, also possibly various ways
> of
> > > writing a matrix) would be overloaded and behave as required for the
> > class.
> > > Only the implementation of the API would need to care about the nitty
> > > gritty of each representation, and we would all be free to write code
> > that
> > > actually does the interesting analytical stuff.
> > >
> > > Anyway, just throwing some thoughts out there. Any comments
> appreciated.
> > >
> > > Cheers,
> > >
> > > Aaron
> > >
> > > [[alternative HTML version deleted]]
> > >
> > >
> > > _______________________________________________
> > > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:
> > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>><mailto:
> > > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>> mailing
> list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
> >
> >
> >
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list