[Bioc-devel] rbind for ExpressionSet objects?

Wed May 7 07:15:38 CEST 2008

On Tue, 6 May 2008, Laurent Gautier wrote:

> 2008/5/6 Gordon K Smyth <smyth at wehi.edu.au>:
>> Thanks to both Martin's for replies.
>>
>>  I hadn't realised before that combine() is actually a merge-like function,
>> although admitedly more careful reading of the help page would have warned
>> me.  The name did confuse me: combine() is unlike c() in the base package
>> but instead very similar to merge().
>>
>>  I really did want genuine rbind() and cbind() functions.  I now see that
>> combine() does more than I want, and the possibility of unwanted effects
>> gives me less trust in it for my work.
>>
>> There is some difference in philosophy here I think.  I think of microarray
>> data objects as analogous to matrices, whereas combine() is viewing them as
>> analogous to data.frames.  It makes sense to "merge" data.frames, but not
>> matrices, because row and column names might not be unique.  I am quite
>> happy to entertain microarray objects with repeated row or column names.
>> Even if I wasn't, I would find it hard to ensure that sample names are
>> unique across different experimental runs, expecially considering that the
>> names may be set by data files and software which are not under my control.
>
> There are always several to skin a cat, but the data structure
> proposed for microarray
> data start being rather handy and save one the trouble of reinventing the wheel
> (and I can tell you that I am of the picky kind).
> It can probably do a lot of what you need, and take care of the
> bookkeeping for you.
> For example, the slot featureData can accommodate repeated names in
> one of its columns
> if you have any need to that.
> About not having unique sample names, I can tell you that *are*
> implicitly having them:
> the position of each column in a matrix is a way to identify your
> data. Making whatever
> you have unique is only a matter of using a sequence of integers for example.
>
> Hoping this helps,
> L.

Dear Laurent,

You've interpretted my post to say almost the opposite of what I intended, 
no doubt my fault for making such an obscure comment late at night.

I can only agree with you that data objects can be useful, that 
featureData columns can contain anything, and that matrices have column 
numbers, and be sobered by the fact that you believe these things to be 
new to me.

If you play with merge() or combine(), you'll find that column number has 
no significance in these functions, and instead column names take 
precedence in determining sample identity.  This can have spectacular 
consequences.  This is not to say that they are bad functions, not at all, 
just that they make a rather strong set assumptions, which are different 
than those made by cbind() and rbind().  For work done at the FHCRC, the 
combine() assumptions are useful and productive (eg, row names might be 
Affy probe IDs and col names might be patient IDs, both of which should be 
unambiguous through an entire study), whereas they're not quite so useful 
for the type of data I see most often.  In making this observation, I am 
not backing away from the whole concept of a data class, just the use of 
combine() over rbind() or cbind().  Hope this is little clearer.

Best wishes
Gordon