[Bioc-devel] any interest in a BiocMatrix core package?

Wed Nov 1 22:13:54 CET 2017

Probably way easier to add the generics to the Matrix package and everyone
just depends on that.

On Wed, Nov 1, 2017 at 1:59 PM, Hervé Pagès <hpages at fredhutch.org> wrote:

> That's probably a good idea but a clean solution would need to
> involve all players, including the Matrix package. Right now there
> are conflicts for some S4 generics defined in Matrix and in
> BiocGenerics (e.g. rowSums). I'm not sure that moving rowSums from
> BiocGenerics to a new MatrixGenerics package would address this.
> Unless MatrixGenerics is on CRAN and Matrix depends on it ;-)
>
> How likely is this to happen?
>
> H.
>
>
> On 11/01/2017 01:44 PM, Peter Hickey wrote:
>
>> I think that's a good idea, Kylie.
>> Pete (DelayedMatrixStats developer)
>>
>> On Thu., 2 Nov. 2017, 6:09 am Kasper Daniel Hansen, <
>> kasperdanielhansen at gmail.com> wrote:
>>
>> I think it makes sense. A lot of sense. Might be useful to involve Henrik
>>> (matrixStats) as well.
>>>
>>> Who are the players, apart from DelayedArray/DelayedMatrixStats and
>>> matter?
>>> (and some very old stuff in Biobase which should really be deprecated in
>>> favor of matrixStats).
>>>
>>> Best,
>>> Kasper
>>>
>>> On Wed, Nov 1, 2017 at 3:03 PM, Bemis, Kylie <k.bemis at northeastern.edu>
>>> wrote:
>>>
>>> Hi all,
>>>>
>>>> To continue a variant of this conversation, with the latest BioC
>>>> release,
>>>> we now have quite a few packages that are implementing various
>>>> matrix-related S4 generic functions, many of them relying on matrixStats
>>>>
>>> as
>>>
>>>> a template.
>>>>
>>>> I was wondering if there is any interest or intention to create a common
>>>> MatrixGenerics/ArrayGenerics package on which we can depend to import
>>>> the
>>>> relevant S4 generic functions. Although BiocGeneric has a few like
>>>> ‘rowSums()’ and ‘colMeans()’, etc., there are many more that are
>>>> implemented across ‘DelayedArray', ‘DelayedMatrixStats', my own package
>>>> ‘matter', etc., including ‘apply()’, ‘rowSds()’, ‘colVars()’, and so
>>>>
>>> forth.
>>>
>>>>
>>>> It would be nice to have a single package with minimal additional
>>>> dependencies (a la BiocGenerics) where we could import the various S4
>>>> generics and avoid unwanted namespace collisions.
>>>>
>>>> Have there been any thoughts on this?
>>>>
>>>> Many thanks,
>>>> Kylie
>>>>
>>>> ~~~
>>>> Kylie Ariel Bemis
>>>> Future Faculty Fellow
>>>> College of Computer and Information Science
>>>> Northeastern University
>>>> kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url
>>>> ?u=https-3A__kuwisdelu.github.io&d=DwIGaQ&c=eRAMFD45gAfqt84V
>>>> tBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bg
>>>> dmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=jvekQlr-c1DbU0g-
>>>> P5b_FApuAd33vBk3IMDG5F_slQo&e=>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mar 3, 2017, at 11:27 AM, Kasper Daniel Hansen <
>>>> kasperdanielhansen at gmail.com<mailto:kasperdanielhansen at gmail.com>>
>>>>
>>> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey <
>>>>
>>> stvjc at channing.harvard.edu
>>>
>>>> <mailto:stvjc at channing.harvard.edu>> wrote:
>>>>
>>>>
>>>> On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen <
>>>> kasperdanielhansen at gmail.com<mailto:kasperdanielhansen at gmail.com>>
>>>>
>>> wrote:
>>>
>>>> Some comment on Aaron's stuff
>>>>
>>>> One possibility for doing things like this is if your code can be done
>>>> in
>>>> C++ using a subset of rows or columns.  That can sometimes give the
>>>> necessary speed up.  What I mean is this
>>>>
>>>> Say you can safely process 1000 cells (not matrix cells, but biological
>>>> cells, aka columns) at a time in RAM
>>>>
>>>> iterate in R:
>>>>    get chunk i containing 1000 cells from the backend data storage
>>>>    do something on this sub matrix where everything is in a normal
>>>> matrix
>>>> and you just use C++
>>>>    write results out to whatever backend you're using
>>>>
>>>> Then, with a million cells you iterate over 1000 chunks in R.  And you
>>>> don't need to "touch" the full dataset which can be stored on an
>>>>
>>> arbitrary
>>>
>>>> backend.
>>>>
>>>> you "touch" it, but you never ingest the whole thing at any time, is
>>>> that
>>>> what you mean?
>>>>
>>>> Yes, you load the chunk into RAM and then just deal with it.
>>>>
>>>> Think of doing 10^10 linear models.  If this was 10^6 I would just use
>>>> lmFit.  But 10^10 doesn't fit into memory.  So I load 10^7 into memory,
>>>>
>>> run
>>>
>>>> lmFit, store results, redo.  This is bound to be much more efficient
>>>> than
>>>> loading a single row into memory and doing lm 10^10 times, because lmFit
>>>>
>>> is
>>>
>>>> written to do many linear models at the same time.
>>>>
>>>> I am suggesting that this is a potential general strategy.
>>>>
>>>>
>>>> And this approach could be run even (potentially) with different chunks
>>>>
>>> on
>>>
>>>> different nodes.
>>>>
>>>> that seems to me to be an important if not essential desideratum.
>>>>
>>>> what then is the role of C++?  extracting a chunk?  preexisting
>>>>
>>> utilities?
>>>
>>>>
>>>> When I say C++ I just mean write an efficient implementation that works
>>>>
>>> on
>>>
>>>> a chunk, like lmFit.  It is true that anything that works on a chunk
>>>> will
>>>> work on a single row/column (like lmFit) but there are possibilities for
>>>> optimization when you work at the chunk level.
>>>>
>>>> Obviously not all computations can be done chunkwise.  But for those
>>>> that
>>>> can, this is a strategy which is independent of the data backend.
>>>>
>>>> I wonder whether this "obviously not" needs to be rethought.  Algorithms
>>>> that are implemented to work with data holistically may need
>>>> to be reexpressed so that they can succeed with chunkwise access.  Is
>>>>
>>> this
>>>
>>>> a new mindset needed for holist developers, or can the
>>>> effective data decompositions occur autonomously?
>>>>
>>>> Well, I would say it is obvious that not all computations can be done
>>>> chunkwise.  But of course, in the limit of extremely large data,
>>>>
>>> algorithms
>>>
>>>> which needs to cycle over everything no longer scale.  So in that case
>>>>
>>> all
>>>
>>>> practical computations can be done chunkwise, out of necessity.  For
>>>>
>>> single
>>>
>>>> cell right now where it is just millions of cells on the horizon people
>>>> will think that they can get "standard" holistic approaches to work (and
>>>> that is probably true).  If they had a billion cells they probably
>>>>
>>> wouldn't
>>>
>>>> think about that.
>>>>
>>>> Kasper
>>>>
>>>> If you need direct access to the data in the backend in C++  it will be
>>>> extremely backend dependent what is fast and how to do it.  That doesn't
>>>> mean we shouldn't do it though.
>>>>
>>>> Best,
>>>> Kasper
>>>>
>>>>
>>>>
>>>> On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey <
>>>>
>>> stvjc at channing.harvard.edu<
>>>
>>>> mailto:stvjc at channing.harvard.edu>> wrote:
>>>> Kylie, thanks for reminding us of matter -- I saw you speak about this
>>>> at
>>>> the first Bioconductor Boston Meetup, but it
>>>> went like lightning.   For developers contemplating an approach to
>>>> representing high-volume rectangular data,
>>>> where there is no dominant legacy format, it is natural to wonder
>>>> whether
>>>> HDF5 would be adequate, and,
>>>> further, to wonder how to demonstrate that it is or is not dominated by
>>>> some other approach for a given set
>>>> of tasks.  Should we devise a set of bioinformatic benchmark problems to
>>>> foster comparison and informed
>>>> decisionmaking?  @becker.gabe: is ALTREP far enough along that one could
>>>> contemplate benchmarking with it?
>>>>
>>>> On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie <k.bemis at northeastern.edu
>>>> <
>>>> mailto:k.bemis at northeastern.edu>>
>>>> wrote:
>>>>
>>>> It’s not there yet, but I plan to expose a C++ API for my disk-backed
>>>>> matrix objects in the next version of my ‘matter’ package.
>>>>>
>>>>> It’s getting easier to interchange matter/HDF5Array/bigmemory/etc.
>>>>> objects at the R level, especially if using a frontend like
>>>>>
>>>> DelayedArray
>>>
>>>> on
>>>>
>>>>> top of them, but it would be nice to have a common C++ API that I could
>>>>> hook into as well (a la Rcpp), so new C/C++ could be re-used across
>>>>>
>>>> various
>>>>
>>>>> backends more easily.
>>>>>
>>>>> Kylie
>>>>>
>>>>> ~~~
>>>>> Kylie Ariel Bemis
>>>>> Future Faculty Fellow
>>>>> College of Computer and Information Science
>>>>> Northeastern University
>>>>> kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url
>>>>> ?u=http-3A__kuwisdelu.github.io_&d=DwIGaQ&c=eRAMFD45gAfqt84
>>>>> VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5b
>>>>> gdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=fSRhAUD8T-r7DYaWBk
>>>>> 9MoCQJeITrNmKX-1ZwZVtaISk&e=><https://
>>>>>
>>>> kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url
>>>> ?u=https-3A__kuwisdelu.github.io_&d=DwIGaQ&c=eRAMFD45gAfqt84
>>>> VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5b
>>>> gdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=wgiAIZjLv2OCvDPgV8
>>>> 0yWizDZZN_Icla1Xs84hAieOI&e=>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Feb 24, 2017, at 4:50 PM, Aaron Lun <alun at wehi.edu.au<mailto:alun@
>>>>>
>>>> wehi.edu.au><mailto:alun@<mailto:alun@>
>>>>
>>>>> wehi.edu.au<https://urldefense.proofpoint.com/v2/url?u=http-
>>>>> 3A__wehi.edu.au_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3Xe
>>>>> AvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt
>>>>> -mzOZWaJ649FWWr_wCcCEs&s=mdmCex6kampvZYMbfvpZeocnDj-4bnCjsld
>>>>> 5yo_CJsE&e=>>> wrote:
>>>>>
>>>>> It's a good place to start, though it would be very handy to have a
>>>>>
>>>> C(++)
>>>
>>>> API that can be linked against. I'm not sure how much work that would
>>>>> entail but it would give downstream developers a lot more options. Sort
>>>>>
>>>> of
>>>>
>>>>> like how we can link to Rhtslib, which speeds up a lot of BAM file
>>>>> processing, instead of just relying on Rsamtools.
>>>>>
>>>>>
>>>>> -Aaron
>>>>>
>>>>> ________________________________
>>>>> From: Tim Triche, Jr. <tim.triche at gmail.com<mailto:
>>>>>
>>>> tim.triche at gmail.com
>>>
>>>> <mailto:tim.triche at gmail.com<mailto:tim.triche at gmail.com>>>
>>>>> Sent: Saturday, 25 February 2017 8:34:58 AM
>>>>> To: Aaron Lun
>>>>> Cc: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org><mailto:
>>>>>
>>>> bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>>
>>>>
>>>>> Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
>>>>>
>>>>> yes
>>>>>
>>>>> the DelayedArray framework that handles HDF5Array, etc. seems like the
>>>>> right choice?
>>>>>
>>>>> --t
>>>>>
>>>>> On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <alun at wehi.edu.au<mailto:
>>>>>
>>>> alun at wehi.edu.au><mailto:alun@<mailto:alun@>
>>>>
>>>>> wehi.edu.au<https://urldefense.proofpoint.com/v2/url?u=http-
>>>>> 3A__wehi.edu.au_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3Xe
>>>>> AvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt
>>>>> -mzOZWaJ649FWWr_wCcCEs&s=mdmCex6kampvZYMbfvpZeocnDj-4bnCjsld
>>>>> 5yo_CJsE&e=>><mailto:alun at wehi.edu.au<mailto:
>>>>>
>>>> alun at wehi.edu.au>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I just attended the Human Cell Atlas meeting in Stanford, and people
>>>>>
>>>> were
>>>
>>>> talking about gene expression matrices for >1 million cells. If we
>>>>>
>>>> assume
>>>
>>>> that we can get non-zero expression profiles for ~5000 genes, we�d be
>>>>> talking about a 5000 x 1 million matrix for the raw count data. This
>>>>>
>>>> would
>>>>
>>>>> be 20-40 GB in size, which would clearly benefit from sparse (via
>>>>>
>>>> Matrix)
>>>
>>>> or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5,
>>>>>
>>>> etc.).
>>>
>>>>
>>>>> I�m wondering whether there is any appetite amongst us for making a
>>>>> consistent BioC API to handle these matrices, sort of like what
>>>>> BiocParallel does for multicore and snow. It goes without saying that
>>>>>
>>>> the
>>>
>>>> different matrix representations should have consistent functions at
>>>>>
>>>> the
>>>
>>>> R
>>>>
>>>>> level (rbind/cbind, etc.) but it would also be nice to have an
>>>>>
>>>> integrated
>>>
>>>> C/C++ API (accessible via LinkedTo). There�s many non-trivial things
>>>>>
>>>> that
>>>
>>>> can be done with this type of data, and it is often faster and more
>>>>>
>>>> memory
>>>>
>>>>> efficient to do these complex operations in compiled code.
>>>>>
>>>>> I was thinking of something that you could supply any supported matrix
>>>>> representation to a registered function via .Call; the C++ constructor
>>>>> would recognise the type of matrix during class instantiation; and
>>>>> operations (row/column/random read access, also possibly various ways
>>>>>
>>>> of
>>>
>>>> writing a matrix) would be overloaded and behave as required for the
>>>>>
>>>> class.
>>>>
>>>>> Only the implementation of the API would need to care about the nitty
>>>>> gritty of each representation, and we would all be free to write code
>>>>>
>>>> that
>>>>
>>>>> actually does the interesting analytical stuff.
>>>>>
>>>>> Anyway, just throwing some thoughts out there. Any comments
>>>>>
>>>> appreciated.
>>>
>>>>
>>>>> Cheers,
>>>>>
>>>>> Aaron
>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:
>>>>>
>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>><mailto:
>>>>
>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>> mailing
>>>>>
>>>> list
>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>>>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
>>>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB
>>>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ
>>>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>>>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
>>>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB
>>>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ
>>>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>>>
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>>>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
>>>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB
>>>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ
>>>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>>>
>>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
>>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB
>>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ
>>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
>>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB
>>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ
>>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>>
>>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB
>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ
>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB
>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ
>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]