[Rd] proposal for "strict" versions of subsetting operators

Tony Plate tplate at attglobal.net
Tue Oct 28 05:49:44 MET 2003


I'd like to propose adding "strict" versions of the subsetting operators "[", "[[", and "$" to the R language.  These strict versions would be intended for use in programming rather than in interactive use.  They do not perform any form of partial string matching or opportunistic simplification such as dimension dropping by default.  They allow more precise specification of which dimensions to drop, which can be very useful when working with higher-dimensional arrays.  The syntax for them can be created by prefixing the existing subsetting operators with some single character such as "$", e.g.: x$[i, j], x$[[i]], and x$$component.  Such syntax would be an addition to the existing syntax and would not cause any existing code to behave differently.

I've created a set of patches that implement and test this syntax addition for R-1.8.0.  They can be downloaded from http://pws.prserv.net/tap/software/strict-subset-R.1.8.0.zip. These patches implement these strict subsetting operators for vectors, lists and arrays, with the following semantics:

(1) The drop argument can be logical or numeric, and the default is F.  If drop is requested, is non-optional (it is an error to request to drop a dimension that is not droppable).  Numeric values for drop, or logical vector values with length greater than one can be used to specify exactly which dimensions to drop.  This is very useful when working with arrays with 3 dimensions or more.
(2) no partial string matching (currently in R, partial string matching is used only for list components and for data frame row indexing)

I've not supplied any documentation is this patch, but if there is any interest in incorporating something like this into the R language, I'd be happy to supply some documentation.  I've also not carefully debugged these patches -- they were intended to test the concept, and they seem to work.  Again, if there is any real interest, I can look over them more carefully and test them more comprehensively.

What follows in this email is some discussion and rational for this addition.

The drop=T default for subsetting tends to catch both new and experienced users.  It generally gives users what they want (at least with matrices).  However, it surprises new users when they don't get what they want, and it fools experienced users because they forget or are too lazy to specify non-default behavior (i.e., drop=F) and they fail to detect this omission because their test cases don't create a situation in which the need for non-default behavior is apparent.

Having a different set of strict subsetting operators might relieve some of the long-standing tension between having subsetting operators support robust programming and making subsetting operators convenient for interactive and common uses (by performing opportunistic simplification, as they currently do).  The following two excerpts demonstrate this tension.

The R FAQ says:
>7.7 Why do my matrices lose dimensions?
>... After much discussion this has been determined to be a feature. ...
>... The drop = FALSE option should be used defensively when programming. ...

Bill Venables (William.Venables at cmis.CSIRO.AU) wrote in an email on an R mailing list on Mon, 15 Nov 1999:

>Re: [R] dimnames
>
>[material explaining the drop=T default for subsetting deleted]
>
>I can't help feeling that this convention, helpful as it might be
>in interactive work, is a bit of a mistake. It second-guesses
>what you mean rather than what you actually program. In that
>respect it is for me an uncomfortable reminder of the sort of
>thing that goes on in much MicroSoft software...

There are at least three ways in which having a set of strict subsetting operators could make R programs more robust and simpler:

(1) Users can make the own programs more robust by consistently using the strict subsetting operators (except when they really want opportunistic simplification).
(2) If a package writer adopts use of the strict subsetting operators, users can have a higher level of confidence that the functions in the package will correctly handle all data presented to them.
(3) Users can more easily work with higher dimensional arrays.  Currently it is requires somewhat cumbersome expressions, or custom functions to, for example, extract a m x k matrix from a n x m x k array (because if one relies on the default in an expression like x[i, , ], a vector will result if m or k happen to be 1, and if one uses x[i, , ,drop=F], then one must explicitly manipulate the dim attribute to get a matrix.)
Here is a brief transcript showing the more easily predicted behavior of the "$[" operator on a 3-d array from which a matrix "slice" is being extracted.  In one case (x1) the slice has both dimensions with length > 1, in the other case (x2), the slice has one dimension of length 1.

> x1 <- array(1:24, dim=c(2,3,4), dimnames=list(LETTERS[25:26],letters[1:3],LETTERS[1:4]))
> x2 <- x1$[,1,]
> dim(x1)
[1] 2 3 4
> dim(x2)
[1] 2 1 4
> dim(x1[1,,])
[1] 3 4
> dim(x1$[1,,])
[1] 1 3 4
> dim(x1$[1,,,drop=1])
[1] 3 4
> dim(x2[1,,])
NULL
> dim(x2$[1,,])
[1] 1 1 4
> dim(x2$[1,,,drop=1])
[1] 1 4
>

The ability to simply and robustly manipulate 3-d (and higher dimensional) arrays is probably going to become more important, as cheap computers now have the storage capacity to store in memory large amounts of data that are most naturally represented as higher dimensional arrays.  Quite a lot of financial data is like this, and I would guess that some biological data is also like this.

It would also be relatively easy to provide some quality-control tools that helped users consistently use more robust subsetting patterns (e.g., by prohibiting use of standard subsetting operations in programs, or warning when a standard subsetting operation was used without supplying the drop argument).

Finally, it's worth noting that it would be possible to provide this functionality in other ways, such as with new functions for subsetting, and/or with new arguments and new behaviors for the existing subsetting operators.  However, code written using subsetting functions and/or arguments specifying behavior lacks the conciseness and visual scanability of code written using subsetting operators.  Admittedly, an operator like "$[" does not have quite the same elegance and simplicity as "[", but it is still quick to type and concise on the screen.

-- Tony Plate

tplate at acm.org



More information about the R-devel mailing list