[Bioc-devel] changed 'split' behavior
Martin Morgan
mtmorgan at fhcrc.org
Wed Feb 27 02:13:28 CET 2013
Bioc-develers,
Wanted to mention a change to R svn revision 62015 arising from
https://stat.ethz.ch/pipermail/r-devel/2013-January/065700.html
when the first argument, x, is a class. Previously, the behaviour was like
ind <- split(seq_along(f), f)
lapply(ind, function(i) x[i])
but the updated behaviour is like
ind <- split(seq_along(x), f)
lapply(ind, function(i) x[i])
with 'x' replacing 'f' in seq_along. This is more consistent with the
documentation and other parts of the default implementation.
We've seen two types of problem. The first is as in genomeIntervals after
running example(interval_union) where we have an object 'i' that represents 7
intervals
> i
Object of class Genome_intervals_stranded
7 base intervals and 0 inter-base intervals(*):
i.gene.1 chr01 + [1, 2]
i.gene.2 chr01 + (2, 5)
i.gene.3 chr02 - [11, 12)
i.gene.4 chr02 - [8, 9)
i.gene.5 chr02 - [5, 10]
i.gene.6 chr02 + [4, 12]
i.gene.7 chr03 + [2, 6)
annotation:
seq_name inter_base strand
1 chr01 FALSE +
2 chr01 FALSE +
3 chr02 FALSE -
4 chr02 FALSE -
5 chr02 FALSE -
6 chr02 FALSE +
7 chr03 FALSE +
and one wishes to split by strand.
> split(i, strand(i))
Error in x at .Data[i, , drop = FALSE] : subscript out of bounds
This fails because seq_along uses length(i) != length(strand(i)) (length(i)
reports 14, because i has a 7 x 2 matrix in its .Data slot)
A solution is to define a length method at an appropriate place in the class
hierarchy. Such a method would seem to make 'sense' with the vector-like
behaviour of [ sub-setting on i.
setMethod(length, "Intervals_virtual", function(x) nrow(x))
> split(i, strand(i))
$`+`
Object of class Genome_intervals_stranded
4 base intervals and 0 inter-base intervals(*):
i.gene.1 chr01 + [1, 2]
i.gene.2 chr01 + (2, 5)
i.gene.6 chr02 + [4, 12]
i.gene.7 chr03 + [2, 6)
annotation:
seq_name inter_base strand
1 chr01 FALSE +
2 chr01 FALSE +
6 chr02 FALSE +
7 chr03 FALSE +
$`-`
Object of class Genome_intervals_stranded
3 base intervals and 0 inter-base intervals(*):
i.gene.3 chr02 - [11, 12)
i.gene.4 chr02 - [8, 9)
i.gene.5 chr02 - [5, 10]
annotation:
seq_name inter_base strand
3 chr02 FALSE -
4 chr02 FALSE -
5 chr02 FALSE -
In a second example, fastseg::fastseg can take an ExpressionSet
> fdata = AnnotatedDataFrame(data.frame(f=c("M", "M", "F", "F")))
> e = ExpressionSet(matrix(1:12, 4), featureData=fdata)
and internally does something like
> split(e, fData(e)$f)
... list of ExpressionSets, but not split by fData(e)$f
Warning message:
In split.default(e, fData(e)$f) :
data length is not a multiple of split variable
Here ExpressionSet is really more matrix-like than vector-like. length(e)
reports 1 (maybe it should report nrow * ncol, like matrix), and it doesn't make
sense to write a length method on ExpressionSet that returns the number of rows,
or a split,ExpressionSet-method. Instead the most parsimonious solution seems to
be to replace a call to split with a one-liner that recapitulates the original
behaviour
lapply(split(seq_len(nrow(x)), f=f), function(ind) x[ind, , drop = FALSE])
(split has a 'drop' argument that one might wish to pay attention to).
Perhaps there are similar uses of split in your own code?
Martin
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list