[Bioc-devel] Subsetting eSet-like objects with duplicated indices

Wed Feb 12 15:49:12 CET 2014

On 02/11/2014 05:03 PM, Benilton Carvalho wrote:
> Hi,
>
> I'm trying to understand why FeatureSet objects behave slightly different
> than eSet objects.

There's a combination of things going on, some of which are unfortunate / 
unintended.

The basic problem is that, with regard to row names, subsetting a matrix with 
duplicate indexes behaves differently from subsetting a data.frame

     > matrix(0, 2, 2, dimnames=list(1:2, 3:4))[c(1,1),]
       3 4
     1 0 0
     1 0 0
     > data.frame(x=1:2, y=3:4)[c(1, 1),]
         x y
     1   1 3
     1.1 1 3

The creation of artificial row names is particularly bad when the row name 
identifier has an integer component, like an Ensembl gene id, because then the 
row name appears somehow legitimate but really isn't.

What happens with subsetting an ExpressionSet? Some of each, unfortunately

     m = matrix(0, 2, 2, dimnames=list(1:2, 3:4))
     e = ExpressionSet(m)[c(1, 1),]
     rownames(fData(e))    ## featureNames(featureData(e))
     ## [1] "1"   "1.1"
     rownames(exprs(e))    ## featureNames(assayData(e))
     ## [1] "1" "1"

and perhaps more unfortunately the validity of the object returned by subsetting 
is not checked

     validObject(e)
     ## Error in validObject(e) :
     ##   invalid class "ExpressionSet" object: featureNames differ
     ##   between assayData and featureData

NChannelSet seems to behave better, checking that there are confusing labels and 
failing.

Because the row identifiers need to be munged, and munged identifiers are bad, 
it seems like the NChannelSet failure is desired. The behavior of ExpressionSet 
needs to be cleaned up. It seems like the identifiers could be managed 
separately from the row names, and the validity of returned objects checked. The 
latter is likely to break code that current works, because an early paradigm was 
to update an object incrementally.

An alternative is to 'start again' using the much more well-designed IRanges 
infrastructure, along the lines of

.ExpressionExperiment <- setClass("ExpressionExperiment",
     representation(exptData="List",
                    rowData="DataFrame",
                    colData="DataFrame",
                    assays="SimpleList"))

Simon Anders will recognize this design from an earlier suggestion of his.

Martin

>
> Here's the one example I'm trying to work out:
>
> if (!require(pd.hugene.1.0.st.v1)){
>    library(BiocInstaller)
>    biocLite('pd.hugene.1.0.st.v1')
> }
> library(oligoData)
> data(affyGeneFS)
> affyGeneFS
> data(sample.ExpressionSet)
> sample.ExpressionSet
>
> ## subset ExpressionSet
> ## everything ok
> sample.ExpressionSet[c(1, 1),]
>
> ## subset FeatureSet
> ## error: featureNames differ between assayData and featureData
> affyGeneFS[c(1, 1),]
>
> But FeatureSets are derived from NChannelSet objects... so:
>
> example('NChannelSet-class')
> obj
> obj[c(1, 2),] ## OK
> obj[c(1, 1),] ## not OK
>
> I was wondering why/if this is intended (i.e., it works on "single channel"
> eSets, but fails on NChannelSets)?
>
> Thank you so much for any insight,
>
> benilton
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793