[Bioc-devel] assay dimnames in SingleCellExperiment / SummarizedExperiment

Kevin RUE kevinrue67 at gmail.com
Sat Sep 16 12:49:57 CEST 2017


Hi Aaron,

Yes - sorry, I meant the names of dimnames. Dimnames are indeed checked,
but my code was meant to demonstrate that names of dimnames aren't.
Obviously, it's not the end of the world, but just something I noticed
while I was investigating the glitch.

My second point is not that much about calling dim or dimnames, but rather
about the side-effects of having names(dimnames(x)) not NULL, such as the
case of `reshape2::melt`.
I think it'd be one worry less for downstream methods to 'know' the
colnames of a melted assay(x, 1) instead of having "Var1, Var2, value" if
names(dimnames) is NULL, and "something else" if not NULL.

Beyond aesthetics, it's really just semantics, but I do think small stuff
like that, if handled at a higher class level, can encourage downstream
developers to work off a more consistent mental and computational model (my
take from Michael Lawrence's BOF at Bioc2017). In other words, it has a
small cost to implement in the parent class, instead of if-else statements
in each child class.

It could be something as simple as :

   - c("Feature", "Sample") at the `SummarizedExperiment` level
   - overriden by c("Feature", "Cell") in `SingleCellExperiment`
   - overriden by developer's choice in other dependent packages.


All the best,
Kevin


On Sat, Sep 16, 2017 at 6:43 AM, Aaron Lun <alun at wehi.edu.au> wrote:

> I'll leave the first point to the SummarizedExperiment maintainers, though
> I  note that your code seems to be about the names of the dimnames rather
> than the dimnames themselves. (I'm under the impression that consistency in
> the actual dimnames is enforced somehow by the SE constructor.)
>
>
> As for the second point; I suppose we *could* set the second name for the
> dimnames as "Cells" in SingleCellExperiment, though the choice for the
> first name is more ambiguous. This request has come up before, and I've
> never been entirely convinced by its necessity. It seems mostly aesthetic
> to me, and honestly, if a user doesn't already know that rows are genes and
> columns are cells, I can't see them flailing away at the keyboard until
> they call dim() to tell them what the dimensions correspond to.
>
>
> But I guess other people like aesthetics, so if you want, you can put in a
> PR to override dim() and dimnames() for SingleCellExperiment to put some
> names on the returned vectors or lists. If I had to choose, I would go with
> "Features" and "Cells" for the rows and columns, respectively. (We already
> use a RSE so we're already implicitly assuming genomic features.)
>
>
> -Aaron
> ------------------------------
> *From:* Kevin RUE <kevinrue67 at gmail.com>
> *Sent:* Thursday, 14 September 2017 10:57:39 PM
> *To:* bioc-devel
> *Cc:* davis at ebi.ac.uk; risso.davide at gmail.com; Aaron Lun; Maintainer
> *Subject:* assay dimnames in SingleCellExperiment / SummarizedExperiment
>
> Dear all,
>
> I cc-ed to this email individual package maintainer to directly 'notify'
> them of this thread and have their respective opinions, but I thought the
> common use of SummarizedExperiment was worth involving the community as
> well.
>
> Background: I was updating one of my workflow from SCESet to the
> SingleCellExperiment class recently introduced on the development branch.
>
> 1)
> One thing leading to another, I ended up noticing that there is no
> validity check on dimnames of the various assays in SummarizedExperiment.
> In other words, the different assays can have different `dimnames` (or some
> assays can have NULL dimnames). Using the example code from
> SummarizedExperiment:
>
> nrows <- 200; ncols <- 6
> counts3 <- counts2 <- counts <-
>   matrix(runif(nrows * ncols, 1, 1e4), nrows)
>
> rnames <- paste0("F_", sprintf("%03.f", seq_len(nrows)))
> cnames <- LETTERS[1:6]
>
> dimnames(counts) <- list(rnames, cnames)
> dimnames(counts2) <- list(Tags = rnames, Samples = cnames)
> dimnames(counts3) <- list(Features = rnames, Cells = cnames)
>
> colData <- DataFrame(row.names=cnames)
>
> rse <- SummarizedExperiment(assays=SimpleList(c1=counts, c2=counts2,
> c3=counts3), colData=colData)
>
> assayNames(rse)
> names(dimnames(assay(rse, "c1"))) # NULL
> names(dimnames(assay(rse, "c2"))) # [1] "Tags"    "Samples"
> names(dimnames(assay(rse, "c3"))) # [1] "Features" "Cells"
>
> Although not critical, it'd probably be best practice to have a validity
> check on identical dimnames across all assay, so that one does not have to
> worry later about `melt` calls returning different column names whether
> each assay has proper dimnames or not.
>
>
> 2)
> The initial glitch that prompted this email related to the
> `reshape2::melt` method that extracts dimnames, if available, in the
> `scater::plotHighestExprs` function. Anyway, Davis has already prepared a
> fix to deal with the scenario whereby the assay does have dimnames (e.g.
> counts in the edgeR::DGEList class that I generally use to import counts).
> Somehow that wasn't an issue with the SCESet that I was using previously
> (probably a side-effect of ExpressionSet).
>
> The point is, the glitch prompted me to think whether a potential
> standardisation of names(dimnames) could be beneficial, perhaps more
> specifically in the new `SingleCellExperiment` class (as
> SummarizedExperiment has a much more general purpose). Considering the
> fairly specific purpose of the former, I was wondering whether it would be
> worth:
>
>    - enforcing names(dimnames(x)) to "Features" and "Cells", (bearing in
>    mind that features could still be genes, transcripts, ...)
>    - or maybe dropping dimnames altogether, storing them only once
>    elsewhere (although a slot for that seems overkill)
>
> There may be other possibilities that I haven't thought of yet, but I
> thought I'd get the ball rolling.
> Having well-defined dimnames sounds good practice, with the added benefit
> of generating aesthetically pleasing column names in melted data-frame as a
> by-product.
> However, I can't tell whether the handling of dimnames is something that
> needs to be handle by individual downstream package developers, or whether
> standards should be set in parent classes.
>
>
> Thanks for your time!
>
> Best,
> Kevin
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list