[Bioc-devel] rownames in SummerizedExperiments

Mon Apr 7 03:22:56 CEST 2014

On 04/06/2014 04:21 PM, Michael Lawrence wrote:
>
>
>
> On Sun, Apr 6, 2014 at 2:48 PM, Simon Anders <anders at embl.de
> <mailto:anders at embl.de>> wrote:
>
>     Hi Michael
>
>     On 06/04/14 23:32, Michael Lawrence wrote:
>      > On an arbitrary vector, the names do not need to be unique, but they DO
>      > need to be unique on a DataFrame (according to the data.frame
>      > conventions). Conditioning on whether there are duplicate names would be
>      > too complicated, so it is left to the user to declare whether the names
>      > are expected on the result. Since in general the vector names are not
>      > valid rownames, the default is FALSE. I guess if we really wanted to be
>      > consistent with R, we would mangle the names to make them unique, but
>      > that check is expensive.
>
>     Thanks for the response, but I'm not sure I understand it. I thought
>     "use.names=TRUE" instructs "mcols" to use the rownames of the
>     SummerizedExperiment object as rownames for the returned DataFrame. Now,
>     as the rownames of the SummerizedExperiment have to be unique anyway (at
>     least, I suppose they have to -- they are names, too, after all, and not
>     just an arbitrary vector), how can it happen that duplicate names might
>     appear?
>
>
> I don't think the SE rownames are constrained to be unique. I haven't tested it,

Empirically, the row names can be duplicated, but the column names cannot.

The lack of constraint on row names is enabled by the rowData GenomicRanges, 
while the constraint on column names is introduced by the (rownames of the) 
colData DataFrame. So the lack of symmetry in the class leads to lack of 
symmetry for dimnames. The use of GenomicRanges for rows has been the subject of 
previous discussion.

It wouldn't be inconceivable to impose constraints on duplicate row names in 
SummarizedExperiment and set use.names=TRUE by default, or to redefine mcols(se) 
to use.names=!any(dupclicated(se)). There would be performance consequences (how 
much?) and an mcols inconsistency. I think this is part of the same discussion as

   https://stat.ethz.ch/pipermail/bioc-devel/2014-March/005409.html

which I have not yet followed through on.

Syntax wise, there is also

   mcols(se)[rownames(se) == "gene_D", "yellowness"]

This is more efficient (and more error prone) than either use.names or Michael's 
suggestion.

Martin

> but I don't see the assertion in the code. This is because an SE is modeled as a
> matrix, which does not have the same constraint as a data.frame.
>
>     The use case: I have a SummerizedExperiment object with gene IDs in the
>     rownames. Let's say I want to get the value in the meta-data column
>     "yellowness" for "gene_D".
>
>     With en ExpressionSet, I could write:
>         fData(es)["gene_D","yellowness"]
>
>     With SummerizeExperiment, it has to be:
>         mcols(se,use.names=TRUE)["gene_D","yellowness"]
>
>     Of course, it's no big deal, but I find it quite clumsy, and I wonder
>     why it has to be this way.
>
>
> Well, there's this syntax:
> mcols(se["gene_D",])$yellowness
>
>
>        Simon
>
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793