[Bioc-devel] any interest in a BiocMatrix core package?
Martin Morgan
martin.morgan at roswellpark.org
Wed Nov 1 22:26:23 CET 2017
On 11/01/2017 05:15 PM, Bemis, Kylie wrote:
> Yes, the ideal solution seems rather unlikely, but I feel like there must be a solution better than the current situation.
>
> I’d like to implement some more of the functionality from matrixStats for ‘matter’ matrices, but importing DelayedArray and DelayedMatrixStats solely for the generic functions seems like a bit much. Is that the best thing to do though?
It would be very helpful to have this on CRAN, and for matrixStats (and
Matrix) to play along.
Martin
>
> Any suggestions?
>
> -Kylie
>
>> On Nov 1, 2017, at 4:59 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
>>
>> That's probably a good idea but a clean solution would need to
>> involve all players, including the Matrix package. Right now there
>> are conflicts for some S4 generics defined in Matrix and in
>> BiocGenerics (e.g. rowSums). I'm not sure that moving rowSums from
>> BiocGenerics to a new MatrixGenerics package would address this.
>> Unless MatrixGenerics is on CRAN and Matrix depends on it ;-)
>>
>> How likely is this to happen?
>>
>> H.
>>
>> On 11/01/2017 01:44 PM, Peter Hickey wrote:
>>> I think that's a good idea, Kylie.
>>> Pete (DelayedMatrixStats developer)
>>>
>>> On Thu., 2 Nov. 2017, 6:09 am Kasper Daniel Hansen, <
>>> kasperdanielhansen at gmail.com> wrote:
>>>
>>>> I think it makes sense. A lot of sense. Might be useful to involve Henrik
>>>> (matrixStats) as well.
>>>>
>>>> Who are the players, apart from DelayedArray/DelayedMatrixStats and matter?
>>>> (and some very old stuff in Biobase which should really be deprecated in
>>>> favor of matrixStats).
>>>>
>>>> Best,
>>>> Kasper
>>>>
>>>> On Wed, Nov 1, 2017 at 3:03 PM, Bemis, Kylie <k.bemis at northeastern.edu>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> To continue a variant of this conversation, with the latest BioC release,
>>>>> we now have quite a few packages that are implementing various
>>>>> matrix-related S4 generic functions, many of them relying on matrixStats
>>>> as
>>>>> a template.
>>>>>
>>>>> I was wondering if there is any interest or intention to create a common
>>>>> MatrixGenerics/ArrayGenerics package on which we can depend to import the
>>>>> relevant S4 generic functions. Although BiocGeneric has a few like
>>>>> ‘rowSums()’ and ‘colMeans()’, etc., there are many more that are
>>>>> implemented across ‘DelayedArray', ‘DelayedMatrixStats', my own package
>>>>> ‘matter', etc., including ‘apply()’, ‘rowSds()’, ‘colVars()’, and so
>>>> forth.
>>>>>
>>>>> It would be nice to have a single package with minimal additional
>>>>> dependencies (a la BiocGenerics) where we could import the various S4
>>>>> generics and avoid unwanted namespace collisions.
>>>>>
>>>>> Have there been any thoughts on this?
>>>>>
>>>>> Many thanks,
>>>>> Kylie
>>>>>
>>>>> ~~~
>>>>> Kylie Ariel Bemis
>>>>> Future Faculty Fellow
>>>>> College of Computer and Information Science
>>>>> Northeastern University
>>>>> kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url?u=https-3A__kuwisdelu.github.io&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=jvekQlr-c1DbU0g-P5b_FApuAd33vBk3IMDG5F_slQo&e=>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mar 3, 2017, at 11:27 AM, Kasper Daniel Hansen <
>>>>> kasperdanielhansen at gmail.com<mailto:kasperdanielhansen at gmail.com>>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey <
>>>> stvjc at channing.harvard.edu
>>>>> <mailto:stvjc at channing.harvard.edu>> wrote:
>>>>>
>>>>>
>>>>> On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen <
>>>>> kasperdanielhansen at gmail.com<mailto:kasperdanielhansen at gmail.com>>
>>>> wrote:
>>>>> Some comment on Aaron's stuff
>>>>>
>>>>> One possibility for doing things like this is if your code can be done in
>>>>> C++ using a subset of rows or columns. That can sometimes give the
>>>>> necessary speed up. What I mean is this
>>>>>
>>>>> Say you can safely process 1000 cells (not matrix cells, but biological
>>>>> cells, aka columns) at a time in RAM
>>>>>
>>>>> iterate in R:
>>>>> get chunk i containing 1000 cells from the backend data storage
>>>>> do something on this sub matrix where everything is in a normal matrix
>>>>> and you just use C++
>>>>> write results out to whatever backend you're using
>>>>>
>>>>> Then, with a million cells you iterate over 1000 chunks in R. And you
>>>>> don't need to "touch" the full dataset which can be stored on an
>>>> arbitrary
>>>>> backend.
>>>>>
>>>>> you "touch" it, but you never ingest the whole thing at any time, is that
>>>>> what you mean?
>>>>>
>>>>> Yes, you load the chunk into RAM and then just deal with it.
>>>>>
>>>>> Think of doing 10^10 linear models. If this was 10^6 I would just use
>>>>> lmFit. But 10^10 doesn't fit into memory. So I load 10^7 into memory,
>>>> run
>>>>> lmFit, store results, redo. This is bound to be much more efficient than
>>>>> loading a single row into memory and doing lm 10^10 times, because lmFit
>>>> is
>>>>> written to do many linear models at the same time.
>>>>>
>>>>> I am suggesting that this is a potential general strategy.
>>>>>
>>>>>
>>>>> And this approach could be run even (potentially) with different chunks
>>>> on
>>>>> different nodes.
>>>>>
>>>>> that seems to me to be an important if not essential desideratum.
>>>>>
>>>>> what then is the role of C++? extracting a chunk? preexisting
>>>> utilities?
>>>>>
>>>>> When I say C++ I just mean write an efficient implementation that works
>>>> on
>>>>> a chunk, like lmFit. It is true that anything that works on a chunk will
>>>>> work on a single row/column (like lmFit) but there are possibilities for
>>>>> optimization when you work at the chunk level.
>>>>>
>>>>> Obviously not all computations can be done chunkwise. But for those that
>>>>> can, this is a strategy which is independent of the data backend.
>>>>>
>>>>> I wonder whether this "obviously not" needs to be rethought. Algorithms
>>>>> that are implemented to work with data holistically may need
>>>>> to be reexpressed so that they can succeed with chunkwise access. Is
>>>> this
>>>>> a new mindset needed for holist developers, or can the
>>>>> effective data decompositions occur autonomously?
>>>>>
>>>>> Well, I would say it is obvious that not all computations can be done
>>>>> chunkwise. But of course, in the limit of extremely large data,
>>>> algorithms
>>>>> which needs to cycle over everything no longer scale. So in that case
>>>> all
>>>>> practical computations can be done chunkwise, out of necessity. For
>>>> single
>>>>> cell right now where it is just millions of cells on the horizon people
>>>>> will think that they can get "standard" holistic approaches to work (and
>>>>> that is probably true). If they had a billion cells they probably
>>>> wouldn't
>>>>> think about that.
>>>>>
>>>>> Kasper
>>>>>
>>>>> If you need direct access to the data in the backend in C++ it will be
>>>>> extremely backend dependent what is fast and how to do it. That doesn't
>>>>> mean we shouldn't do it though.
>>>>>
>>>>> Best,
>>>>> Kasper
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey <
>>>> stvjc at channing.harvard.edu<
>>>>> mailto:stvjc at channing.harvard.edu>> wrote:
>>>>> Kylie, thanks for reminding us of matter -- I saw you speak about this at
>>>>> the first Bioconductor Boston Meetup, but it
>>>>> went like lightning. For developers contemplating an approach to
>>>>> representing high-volume rectangular data,
>>>>> where there is no dominant legacy format, it is natural to wonder whether
>>>>> HDF5 would be adequate, and,
>>>>> further, to wonder how to demonstrate that it is or is not dominated by
>>>>> some other approach for a given set
>>>>> of tasks. Should we devise a set of bioinformatic benchmark problems to
>>>>> foster comparison and informed
>>>>> decisionmaking? @becker.gabe: is ALTREP far enough along that one could
>>>>> contemplate benchmarking with it?
>>>>>
>>>>> On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie <k.bemis at northeastern.edu<
>>>>> mailto:k.bemis at northeastern.edu>>
>>>>> wrote:
>>>>>
>>>>>> It’s not there yet, but I plan to expose a C++ API for my disk-backed
>>>>>> matrix objects in the next version of my ‘matter’ package.
>>>>>>
>>>>>> It’s getting easier to interchange matter/HDF5Array/bigmemory/etc.
>>>>>> objects at the R level, especially if using a frontend like
>>>> DelayedArray
>>>>> on
>>>>>> top of them, but it would be nice to have a common C++ API that I could
>>>>>> hook into as well (a la Rcpp), so new C/C++ could be re-used across
>>>>> various
>>>>>> backends more easily.
>>>>>>
>>>>>> Kylie
>>>>>>
>>>>>> ~~~
>>>>>> Kylie Ariel Bemis
>>>>>> Future Faculty Fellow
>>>>>> College of Computer and Information Science
>>>>>> Northeastern University
>>>>>> kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__kuwisdelu.github.io_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=fSRhAUD8T-r7DYaWBk9MoCQJeITrNmKX-1ZwZVtaISk&e=><https://
>>>>> kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url?u=https-3A__kuwisdelu.github.io_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=wgiAIZjLv2OCvDPgV80yWizDZZN_Icla1Xs84hAieOI&e=>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Feb 24, 2017, at 4:50 PM, Aaron Lun <alun at wehi.edu.au<mailto:alun@
>>>>> wehi.edu.au><mailto:alun@<mailto:alun@>
>>>>>> wehi.edu.au<https://urldefense.proofpoint.com/v2/url?u=http-3A__wehi.edu.au_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=mdmCex6kampvZYMbfvpZeocnDj-4bnCjsld5yo_CJsE&e=>>> wrote:
>>>>>>
>>>>>> It's a good place to start, though it would be very handy to have a
>>>> C(++)
>>>>>> API that can be linked against. I'm not sure how much work that would
>>>>>> entail but it would give downstream developers a lot more options. Sort
>>>>> of
>>>>>> like how we can link to Rhtslib, which speeds up a lot of BAM file
>>>>>> processing, instead of just relying on Rsamtools.
>>>>>>
>>>>>>
>>>>>> -Aaron
>>>>>>
>>>>>> ________________________________
>>>>>> From: Tim Triche, Jr. <tim.triche at gmail.com<mailto:
>>>> tim.triche at gmail.com
>>>>>> <mailto:tim.triche at gmail.com<mailto:tim.triche at gmail.com>>>
>>>>>> Sent: Saturday, 25 February 2017 8:34:58 AM
>>>>>> To: Aaron Lun
>>>>>> Cc: bioc-devel at r-project.org<mailto:bioc-devel at r-project.org><mailto:
>>>>> bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>>
>>>>>> Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
>>>>>>
>>>>>> yes
>>>>>>
>>>>>> the DelayedArray framework that handles HDF5Array, etc. seems like the
>>>>>> right choice?
>>>>>>
>>>>>> --t
>>>>>>
>>>>>> On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <alun at wehi.edu.au<mailto:
>>>>> alun at wehi.edu.au><mailto:alun@<mailto:alun@>
>>>>>> wehi.edu.au<https://urldefense.proofpoint.com/v2/url?u=http-3A__wehi.edu.au_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=mdmCex6kampvZYMbfvpZeocnDj-4bnCjsld5yo_CJsE&e=>><mailto:alun at wehi.edu.au<mailto:
>>>>> alun at wehi.edu.au>>> wrote:
>>>>>> Hi everyone,
>>>>>>
>>>>>> I just attended the Human Cell Atlas meeting in Stanford, and people
>>>> were
>>>>>> talking about gene expression matrices for >1 million cells. If we
>>>> assume
>>>>>> that we can get non-zero expression profiles for ~5000 genes, we�d be
>>>>>> talking about a 5000 x 1 million matrix for the raw count data. This
>>>>> would
>>>>>> be 20-40 GB in size, which would clearly benefit from sparse (via
>>>> Matrix)
>>>>>> or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5,
>>>> etc.).
>>>>>>
>>>>>> I�m wondering whether there is any appetite amongst us for making a
>>>>>> consistent BioC API to handle these matrices, sort of like what
>>>>>> BiocParallel does for multicore and snow. It goes without saying that
>>>> the
>>>>>> different matrix representations should have consistent functions at
>>>> the
>>>>> R
>>>>>> level (rbind/cbind, etc.) but it would also be nice to have an
>>>> integrated
>>>>>> C/C++ API (accessible via LinkedTo). There�s many non-trivial things
>>>> that
>>>>>> can be done with this type of data, and it is often faster and more
>>>>> memory
>>>>>> efficient to do these complex operations in compiled code.
>>>>>>
>>>>>> I was thinking of something that you could supply any supported matrix
>>>>>> representation to a registered function via .Call; the C++ constructor
>>>>>> would recognise the type of matrix during class instantiation; and
>>>>>> operations (row/column/random read access, also possibly various ways
>>>> of
>>>>>> writing a matrix) would be overloaded and behave as required for the
>>>>> class.
>>>>>> Only the implementation of the API would need to care about the nitty
>>>>>> gritty of each representation, and we would all be free to write code
>>>>> that
>>>>>> actually does the interesting analytical stuff.
>>>>>>
>>>>>> Anyway, just throwing some thoughts out there. Any comments
>>>> appreciated.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Aaron
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org><mailto:
>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>><mailto:
>>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>> mailing
>>>> list
>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
This email message may contain legally privileged and/or...{{dropped:2}}
More information about the Bioc-devel
mailing list