[Bioc-devel] assay dimnames in SingleCellExperiment / SummarizedExperiment

Thu Sep 14 14:57:39 CEST 2017

Dear all,

I cc-ed to this email individual package maintainer to directly 'notify'
them of this thread and have their respective opinions, but I thought the
common use of SummarizedExperiment was worth involving the community as
well.

Background: I was updating one of my workflow from SCESet to the
SingleCellExperiment class recently introduced on the development branch.

1)
One thing leading to another, I ended up noticing that there is no validity
check on dimnames of the various assays in SummarizedExperiment. In other
words, the different assays can have different `dimnames` (or some assays
can have NULL dimnames). Using the example code from SummarizedExperiment:

nrows <- 200; ncols <- 6
counts3 <- counts2 <- counts <-
  matrix(runif(nrows * ncols, 1, 1e4), nrows)

rnames <- paste0("F_", sprintf("%03.f", seq_len(nrows)))
cnames <- LETTERS[1:6]

dimnames(counts) <- list(rnames, cnames)
dimnames(counts2) <- list(Tags = rnames, Samples = cnames)
dimnames(counts3) <- list(Features = rnames, Cells = cnames)

colData <- DataFrame(row.names=cnames)

rse <- SummarizedExperiment(assays=SimpleList(c1=counts, c2=counts2,
c3=counts3), colData=colData)

assayNames(rse)
names(dimnames(assay(rse, "c1"))) # NULL
names(dimnames(assay(rse, "c2"))) # [1] "Tags"    "Samples"
names(dimnames(assay(rse, "c3"))) # [1] "Features" "Cells"

Although not critical, it'd probably be best practice to have a validity
check on identical dimnames across all assay, so that one does not have to
worry later about `melt` calls returning different column names whether
each assay has proper dimnames or not.

2)
The initial glitch that prompted this email related to the `reshape2::melt`
method that extracts dimnames, if available, in the
`scater::plotHighestExprs` function. Anyway, Davis has already prepared a
fix to deal with the scenario whereby the assay does have dimnames (e.g.
counts in the edgeR::DGEList class that I generally use to import counts).
Somehow that wasn't an issue with the SCESet that I was using previously
(probably a side-effect of ExpressionSet).

The point is, the glitch prompted me to think whether a potential
standardisation of names(dimnames) could be beneficial, perhaps more
specifically in the new `SingleCellExperiment` class (as
SummarizedExperiment has a much more general purpose). Considering the
fairly specific purpose of the former, I was wondering whether it would be
worth:

   - enforcing names(dimnames(x)) to "Features" and "Cells", (bearing in
   mind that features could still be genes, transcripts, ...)
   - or maybe dropping dimnames altogether, storing them only once
   elsewhere (although a slot for that seems overkill)

There may be other possibilities that I haven't thought of yet, but I
thought I'd get the ball rolling.
Having well-defined dimnames sounds good practice, with the added benefit
of generating aesthetically pleasing column names in melted data-frame as a
by-product.
However, I can't tell whether the handling of dimnames is something that
needs to be handle by individual downstream package developers, or whether
standards should be set in parent classes.

Thanks for your time!

Best,
Kevin

	[[alternative HTML version deleted]]