[R] Class that wraps Data Frame

Martin Morgan mtmorgan at fhcrc.org
Fri Aug 31 18:33:48 CEST 2012


I guess there are two issues with data.frame. It comes with more than 
you probably want to support (e.g., list and matrix- like subsetter [, 
the user expecting to be able to independently modify any column). And 
it comes with less than you'd like (e.g., support for a 'column' of S4 
objects). By making a class that contains ('is a') data.frame, you 
commit to both limitations.

You're probably using data.frame as a way to implement some basic 
restrictions -- equal-length columns, for instance. But there are 
additional restrictions, too, columns x, y, z must be present and of 
type integer, character, numeric respectively. For this scenario one is 
better off implementing an S4 class (which provides type checking and 
required columns), a validity method (for enforcing the equal-length 
constraint), accessors, and sub-setting following the semantic that 
you'd like to support, e.g., just along the length of the required slots.

The richest place for this in Bioconductor is the IRanges package, 
though it can be a bit daunting from an architecture point of view. A 
couple of things to point to. One is the DataFrame class, which is like 
a data.frame but supporting a broader (in particular S4) set of columns 
and allowing 'metadata' (actually, DataFrame, so recursive) on each 
column. It is relevant if it is important to maintain S4 classes in a 
data.frame-like structure.

Another is the IRanges class, which in some ways fits your overall use 
case. It is basically a rectangular data structure, but with required 
'columns' (the start and width of the range) and then arbitrary columns 
the user can add. It's implemented with slots for start and width, and 
then 'has a' slot containing a DataFrame as 'metadata columns' (the 
actual implementation is more complicated than this). There are start 
and width accessors. Sub-setting is always list-like 
(single-dimensional, along the ranges). Users wanting to access one of 
'their' columns use $ or extract the metadata columns (via mcols()) as a 
DataFrame and then work on that. Maybe it's worth pointing out that the 
basic definitions are column-oriented, an IRanges instance contains 
start and width vectors; there is no 'IRange' class.

The GRanges class (in the GenomicRanges package) 'has a' IRanges, but 
adds additional required slots ('seqnames' to reference the names of the 
chromosome sequences to which the ranges refer, 'strand' to indicate the 
strand to which the range belongs, etc.). So the pattern here avoids the 
'is a' relationship that simple class extension would imply.

The IRanges package is at

   http://bioconductor.org/packages/devel/bioc/html/IRanges.html

I've described the 'devel' version of Bioconductor

   http://bioconductor.org/developers/useDevel/

Martin


On 08/31/2012 08:39 AM, Bert Gunter wrote:
> To add to what David said ...
>
> Of course, there are already S3 "getters" and "setters" methods for data
> frames ("[.data.frame" and "[<-.data.frame" )*. These could clearly be
> extended -- i.e. the data.frame class could be extended and appropriate S3
> methods written. Whether you use S3 or S4 depends on the degree of control,
> type checking, reuse etc. you want/need. David's suggestion to look at
> Bioconductor is a good one.
>
> Cheers,
> Bert
> *If you are unfamiliar with the S3 extract methods, consult the R Language
> Definition Manual.
>
> On Fri, Aug 31, 2012 at 8:14 AM, David Winsemius <dwinsemius at comcast.net>wrote:
>
>>
>> On Aug 31, 2012, at 5:57 AM, Ramiro Barrantes wrote:
>>
>>> Hello,
>>>
>>> I have again a "good practices"/programming theory question regarding
>> data.frames.
>>>
>>> One of the fundamental objects that I use is the data frame with a
>> particular set of columns that I would fill or get information from, and an
>> entire system would revolve around getting information from or putting
>> information to such data.frame.
>>>
>>> On a different OOP programming language I would be tempted to create a
>> class that would "wrap-around" that data.frame and create "getters" and
>> "setters" methods that would return whatever information I need. I started
>> doing that using S4.
>>>
>>> Does anyone have examples of packages that use that approach or any
>> suggestions?  It just seems to me that a class/object would be a better
>> idea because it would create a single, hopefully well validated way to
>> access information and edit the fundamental data.frame object, which would
>> be helpful if there are several programmers on the team and/or if some of
>> the data.frame manipulations are not straightforward and are best left
>> encapsulated in a method of a class, and then have people use that method.
>>   I would just like to know if there are reasons not do it that way and if
>> there are any examples of packages that use that approach and that I can
>> learn from.
>>
>> You could argue that the entire BioConductor project represents such an
>> effort. It makes extensive use of S4 methods. I'm not a user so cannot
>> readily point to examples of S4 functions that have set. and get. methods
>> for particular sorts of dataframes, but I suspect you can pose the same
>> question on the BioC mailing list and get a more informed answer.
>>
>> --
>> David Winsemius, MD
>> Alameda, CA, USA
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793




More information about the R-help mailing list