[Bioc-devel] transitioning scater/scran to SingleCellExperiment

Mon Aug 7 19:41:41 CEST 2017

> Interestingly the design decisions coverged very well with scanpy 
> <https://github.com/theislab/scanpy#readme>’s AnnData 
> <https://www.pydoc.io/pypi/scanpy-0.2.3/autoapi/data_structs/ann_data/index.html#data_structs.ann_data.AnnData> 
> class that I helped Alex design. Scanpy makes heavy use of HDF5 
> serialization. I think we should quickly converge on a serialization 
> format (keys and so on) so that |AnnData| and |SingleCellExperiment| can 
> have interoperability via HDF5!

Yes, this would be quite interesting. As I mentioned, scater has some 
support for HDF5 serialization, so that's one place to start.

> The only point of criticism is that you, while staying specific to 
> single cell data, named the dimensions “rows” and “columns” instead of 
> e.g. “samples” and “variables”. Alex and me came to the conclusion that 
> |ExpressionSet|’s way of returning a named vector for |dims| is a good 
> idea, and having the dimensions named for their roles reduces confusion.

I guess this would be a question for the SummarizedExperiment 
developers, though personally, I never liked ExpressionSet's inclination 
to slap names on everything.

>       1.
> 
> destiny accepts either an expression matrix or a distance matrix (both 
> with optional metadata).
> 
> Currently the signature is this:
> 
> |DiffusionMap(data = ExpressionSet | data.frame| matrix| Matrix, distance 
> = NULL| "euclidean"| "cosine"| "rankcor") DiffusionMap(data = NULL| 
> data.frame, # Metadata distance = matrix| dist | symmetricMatrix)|||
> 
> The idea is that both when providing expressions and when providing a 
> distance matrix, you should be able to provide metadata. I’m not super 
> happy with my approach, since the current methods of providing metadata 
> differ.
> 
> However, |ExpressionSet| and |SingleCellExperiment| are both specific 
> for expression data. I think neither can hold |dist| objects as data.
> 
> Is it valid and a good idea to neither store counts not exprs, but e.g. 
> |SingleCellExperiment(assays = list(dists = some_mat))|? It wouldn’t be 
> sliced properly, for example, and it being symmetric would mean that 
> column and row metadata is the same…

It probably wouldn't be a good idea to store distances as expression 
matrices. However, if there is a need for it, we can add a new slot for 
distance matrices. I think SC3 has a similar requirement, so perhaps 
this would be more generally useful than I first thought. You can post 
an issue on the github repository to remind Davide or me to do it.

> Is it a good idea to require assays to have certain names (e.g. 
> “exprs” or “dists” here)?

I have thought about putting in a set of recommended assay names, along 
with various methods for them:

- counts: counts, duh
- norm_counts: "normalized" values on the same scale as the counts
- log_counts: log-normalized counts (plus pseudo-count).
- cpm, tpm, fpkm: what it says

The idea is to encourage developers to store assay entries that will 
have a reasonably consistent interpretation across packages. For this 
reason, I'm not putting in "exprs", which could mean anything really.

>       2.
> 
> The |reducedDim| methods would be able to store and retrieve diffusion 
> components in a |SingleCellExperiment|, while destiny’s |dataset| method 
> stores the original data used to create a |DiffusionMap|.
> 
> What do you think is the best approach? Just conversions between the two 
> classes? Or also deprecate |DiffusionMap| objects and create a 
> |diffusion_map| function that returns a |SingleCellExperiment| object 
> with the reduced dimensions and all the necessary metadata for further 
> methods like e.g. DPT?
> 
> I think for the latter, |SingleCellExperiment| isn’t quite cool enough 
> yet :P. I’d like to have the full ergonomics of |DiffusionMap|:
> 
>   * A |names| method (returning gene and per-cell-metadata names)
>   * Gene/per-cell-metadata access by |$| and |[[|.
>   * A |fortify| method that makes everything available in ggplot2. (E.g.
>     |ggplot(dm, aes(DC1, DC2, colour = Condition))| works!)
> 
> I can do without the remaining methods (or provide them in destiny), as 
> they are are neither general purpose enough for |SingleCellExperiment| 
> nor really necessary, e.g. I can add an alias |plot(a_dm_object)| → 
> |plot_dm(a_sce_object)|.

Not everything needs to be a SCE object. In fact, I would argue that it 
doesn't really make sense for the DiffusionMap() function to return a 
SingleCellExperiment object, as this would seem to conceptually limit 
the DiffusionMap() function to single-cell data. (By comparison, it does 
make sense to accept a SCE class - amongst others - as input, given that 
destiny is often used for this type of data.)

 From a user perspective, if the DiffusionMap() function vomits out a 
lot of metadata fields, that might not be desirable if only the final 
diffusion coordinates are of interest. In such cases, I would find it 
easier to just extract the coordinates and store it in reducedDim<- 
manually. Whether this is done from a DiffusionMap or 
SingleCellExperiment output makes little difference to me.

Finally, I'm not sure what advantages those ergonomics provide. Indeed, 
if every package defines its own plot() S4 method for 
SingleCellExperiment, they will clobber each other in the dispatch 
table, resulting in some interesting results dependent on package 
loading order. If you have destiny-specific data and methods, best to 
keep them separate rather than stuffing them into the SCE object.

Our vision for the SCE class is to coordinate inputs into many packages 
across a long, long workflow. A little detour into destiny's classes for 
a small portion of the workflow doesn't pose much trouble, as long as 
any relevant statistics can be extracted and stored in the SCE object 
when it moves to the next stage of the workflow.

-Aaron

> ------------------------------------------------------------------------
> *Von: *"Aaron Lun" <alun at wehi.edu.au>
> *An: *"bioc-devel" <bioc-devel at r-project.org>
> *Gesendet: *Montag, 31. Juli 2017 10:38:03
> *Betreff: *Re: [Bioc-devel] transitioning scater/scran to 
> SingleCellExperiment
> 
> Dear developers,
> 
> Both scater and scran will be migrating to the SingleCellExperiment
> class (https://bioconductor.org/packages/SingleCellExperiment) in the
> next BioC release. This is based on a SummarizedExperiment and provides
> a more modern user interface, as well as supporting different matrix
> representations (e.g., dgCMatrix, HDF5Matrix).
> 
> We note that there are a number of Bioconductor packages that depend
> on/import/suggest scater or scran, which we have listed below:
> 
> scDD
> scone
> SIMLR
> splatter
> Glimma
> SC3
> phenopath
> switchde
> 
> To the maintainers of these packages, we advise switching from SCESet to
> SingleCellExperiment as soon as possible; the former will be deprecated
> in the next release cycle. There are several things to note here:
> 
> - The SCESet previously contained a number of slots relating to
> distances and clustering results. These are no longer present in the
> SingleCellExperiment, in line with the minimalist design philosophy of
> that package. If these are necessary, we suggest extending the
> SingleCellExperiment class in your own packages(*).
> 
> - For packages that depend directly on methods in scater or scran, a
> number of methods have been removed. This aims to simplify the analysis
> workflow and code maintenance by reducing redundancy. Please ensure that
> your package does not need those missing methods by CHECKing it against
> the experimental versions(**) of these two packages:
> 
> https://github.com/LTLA/scran
> https://github.com/davismcc/scater/tree/future
> 
> If there are any issues with the switch, please let us know and we will
> do our best to figure out the most appropriate fix.
> 
> Regards,
> 
> Aaron, Davis and Davide
> 
> (*): If there is popular demand for some slots, we may consider
> including it in the base SingleCellExperiment object.
> 
> (**): These versions are highly experimental and fluid, and results are
> likely to be unstable over the coming month. Nonetheless, if something
> is breaking, it is best that we know sooner rather than later. Or in
> other words, don't start complaining when it's close to release time.
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 
> 
> 
> Helmholtz Zentrum München
> Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
> Ingolstädter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Bärbel Brumme-Bothe
> Geschäftsführer: Prof. Dr. Günther Wess, Heinrich Baßler, Dr. Alfons Enhsen
> Registergericht: Amtsgericht München HRB 6466
> USt-IdNr: DE 129521671