[Bioc-devel] Subsetting eSet-like objects with duplicated indices
Martin Morgan
mtmorgan at fhcrc.org
Wed Feb 12 15:49:12 CET 2014
On 02/11/2014 05:03 PM, Benilton Carvalho wrote:
> Hi,
>
> I'm trying to understand why FeatureSet objects behave slightly different
> than eSet objects.
There's a combination of things going on, some of which are unfortunate /
unintended.
The basic problem is that, with regard to row names, subsetting a matrix with
duplicate indexes behaves differently from subsetting a data.frame
> matrix(0, 2, 2, dimnames=list(1:2, 3:4))[c(1,1),]
3 4
1 0 0
1 0 0
> data.frame(x=1:2, y=3:4)[c(1, 1),]
x y
1 1 3
1.1 1 3
The creation of artificial row names is particularly bad when the row name
identifier has an integer component, like an Ensembl gene id, because then the
row name appears somehow legitimate but really isn't.
What happens with subsetting an ExpressionSet? Some of each, unfortunately
m = matrix(0, 2, 2, dimnames=list(1:2, 3:4))
e = ExpressionSet(m)[c(1, 1),]
rownames(fData(e)) ## featureNames(featureData(e))
## [1] "1" "1.1"
rownames(exprs(e)) ## featureNames(assayData(e))
## [1] "1" "1"
and perhaps more unfortunately the validity of the object returned by subsetting
is not checked
validObject(e)
## Error in validObject(e) :
## invalid class "ExpressionSet" object: featureNames differ
## between assayData and featureData
NChannelSet seems to behave better, checking that there are confusing labels and
failing.
Because the row identifiers need to be munged, and munged identifiers are bad,
it seems like the NChannelSet failure is desired. The behavior of ExpressionSet
needs to be cleaned up. It seems like the identifiers could be managed
separately from the row names, and the validity of returned objects checked. The
latter is likely to break code that current works, because an early paradigm was
to update an object incrementally.
An alternative is to 'start again' using the much more well-designed IRanges
infrastructure, along the lines of
.ExpressionExperiment <- setClass("ExpressionExperiment",
representation(exptData="List",
rowData="DataFrame",
colData="DataFrame",
assays="SimpleList"))
Simon Anders will recognize this design from an earlier suggestion of his.
Martin
>
> Here's the one example I'm trying to work out:
>
> if (!require(pd.hugene.1.0.st.v1)){
> library(BiocInstaller)
> biocLite('pd.hugene.1.0.st.v1')
> }
> library(oligoData)
> data(affyGeneFS)
> affyGeneFS
> data(sample.ExpressionSet)
> sample.ExpressionSet
>
> ## subset ExpressionSet
> ## everything ok
> sample.ExpressionSet[c(1, 1),]
>
> ## subset FeatureSet
> ## error: featureNames differ between assayData and featureData
> affyGeneFS[c(1, 1),]
>
> But FeatureSets are derived from NChannelSet objects... so:
>
> example('NChannelSet-class')
> obj
> obj[c(1, 2),] ## OK
> obj[c(1, 1),] ## not OK
>
> I was wondering why/if this is intended (i.e., it works on "single channel"
> eSets, but fails on NChannelSets)?
>
> Thank you so much for any insight,
>
> benilton
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list