[Bioc-devel] changed 'split' behavior

Wed Feb 27 02:13:28 CET 2013

Bioc-develers,

Wanted to mention a change to R svn revision 62015 arising from

     https://stat.ethz.ch/pipermail/r-devel/2013-January/065700.html

when the first argument, x, is a class. Previously, the behaviour was like

     ind <- split(seq_along(f), f)
     lapply(ind, function(i) x[i])

but the updated behaviour is like

     ind <- split(seq_along(x), f)
     lapply(ind, function(i) x[i])

with 'x' replacing 'f' in seq_along. This is more consistent with the 
documentation and other parts of the default implementation.

We've seen two types of problem. The first is as in genomeIntervals after 
running example(interval_union) where we have an object 'i' that represents 7 
intervals

 > i
Object of class Genome_intervals_stranded
7 base intervals and 0 inter-base intervals(*):
i.gene.1 chr01 + [1, 2]
i.gene.2 chr01 + (2, 5)
i.gene.3 chr02 - [11, 12)
i.gene.4 chr02 - [8, 9)
i.gene.5 chr02 - [5, 10]
i.gene.6 chr02 + [4, 12]
i.gene.7 chr03 + [2, 6)

annotation:
   seq_name inter_base strand
1    chr01      FALSE      +
2    chr01      FALSE      +
3    chr02      FALSE      -
4    chr02      FALSE      -
5    chr02      FALSE      -
6    chr02      FALSE      +
7    chr03      FALSE      +

and one wishes to split by strand.

 > split(i, strand(i))
Error in x at .Data[i, , drop = FALSE] : subscript out of bounds

This fails because seq_along uses length(i) != length(strand(i)) (length(i) 
reports 14, because i has a 7 x 2 matrix in its .Data slot)

A solution is to define a length method at an appropriate place in the class 
hierarchy. Such a method would seem to make 'sense' with the vector-like 
behaviour of [ sub-setting on i.

   setMethod(length, "Intervals_virtual", function(x) nrow(x))

 > split(i, strand(i))
$`+`
Object of class Genome_intervals_stranded
4 base intervals and 0 inter-base intervals(*):
i.gene.1 chr01 + [1, 2]
i.gene.2 chr01 + (2, 5)
i.gene.6 chr02 + [4, 12]
i.gene.7 chr03 + [2, 6)

annotation:
   seq_name inter_base strand
1    chr01      FALSE      +
2    chr01      FALSE      +
6    chr02      FALSE      +
7    chr03      FALSE      +

$`-`
Object of class Genome_intervals_stranded
3 base intervals and 0 inter-base intervals(*):
i.gene.3 chr02 - [11, 12)
i.gene.4 chr02 - [8, 9)
i.gene.5 chr02 - [5, 10]

annotation:
   seq_name inter_base strand
3    chr02      FALSE      -
4    chr02      FALSE      -
5    chr02      FALSE      -

In a second example, fastseg::fastseg can take an ExpressionSet

 > fdata = AnnotatedDataFrame(data.frame(f=c("M", "M", "F", "F")))
 > e = ExpressionSet(matrix(1:12, 4), featureData=fdata)

and internally does something like

 > split(e, fData(e)$f)
... list of ExpressionSets, but not split by fData(e)$f
Warning message:
In split.default(e, fData(e)$f) :
   data length is not a multiple of split variable

Here ExpressionSet is really more matrix-like than vector-like. length(e) 
reports 1 (maybe it should report nrow * ncol, like matrix), and it doesn't make 
sense to write a length method on ExpressionSet that returns the number of rows, 
or a split,ExpressionSet-method. Instead the most parsimonious solution seems to 
be to replace a call to split with a one-liner that recapitulates the original 
behaviour

lapply(split(seq_len(nrow(x)), f=f), function(ind) x[ind, , drop = FALSE])

(split has a 'drop' argument that one might wish to pay attention to).

Perhaps there are similar uses of split in your own code?

Martin

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793