[Bioc-devel] Syntactically correct names in DataFrames

Hahne, Florian florian.hahne at novartis.com
Fri Jun 29 10:13:01 CEST 2012

That makes sense. But since data.frames do support non-valid names through
the check.names argument I would think that DataFrame should do that, too.
One of the use cases where this issue pops up for me is when I use a
GRanges object as a container for numeric sample data, where each
elementMetadata column holds values for a single sample. I would like to
retain the sample name information in the column names, but this breaks
for things like "1DaySample1", or the like.
I must admit that I never liked the default behavior of data.frames, which
certainly makes sense in the modeling realm and for various historic
reasons, but nowadays data.frames have become so much of an integral part
of the R language, and choosing to opt out of syntactically valid names
via check.names is not such an exotic thing to do anymore.

From:  Michael Lawrence <lawrence.michael at gene.com>
Date:  Thursday, June 28, 2012 3:58 PM
To:  NIBR <florian.hahne at novartis.com>
Cc:  "bioc-devel at r-project.org" <bioc-devel at r-project.org>
Subject:  Re: [Bioc-devel] Syntactically correct names in DataFrames

Hi Florian, 

A guiding principle in the design of DataFrame was consistency with
data.frame, so that is why we check for syntactic validity of the column
names.  The underlying reasons for this are probably historic and related
to the rough equivalence between lists and environments.

As for the error you encountered below, that seems to be fixed in devel.


On Thu, Jun 28, 2012 at 6:40 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:

Hi all,
I have been playing around with the DataFrame class a bit and realized
that it always enforces syntactically correct column names. Since it is a
generalization of the basic R data.frames I am not quite sure why that has
to be the case.

Assuming I start with a regular data.frame with non-standard names:

> foo <- data.frame("1a"=1:3, b=4:6, check.names=FALSE)
> foo
  1a b
1  1 4
2  2 5
3  3 6

Coercing this into a DataFrame forces a name change:
> DataFrame(foo)
DataFrame with 3 rows and 2 columns
        X1a         b
  <integer> <integer>
1         1         4
2         2         5
3         3         6

> as(foo, "DataFrame")
DataFrame with 3 rows and 2 columns
        X1a         b
  <integer> <integer>
1         1         4
2         2         5
3         3         6

My first intuition was to try this:
> DataFrame(foo, check.names=FALSE)
DataFrame with 3 rows and 3 columns
Error in matrix(unlist(lapply(object, function(x) paste("<", class(x),  :
  length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In if (check.names) vnames <- make.names(vnames, unique = TRUE) :
  the condition has length > 1 and only the first element will be used

Now apparently there are multiple things going on here. First of all,
check.names is recycled by the DataFrame constructor because it thinks
that it is just another variable to add to the DataFrame later. The
initializer method however seems to recognize it for the coercion into a
data.frame, but it complains because it's length is >1. Also the show
method is broken because things don't really match anymore. The Data.Table
show method in IRanges seems to be the culprit here.

My simple question here is: why are syntactic names enforced at all. And
if that is a feature could't there be a way to turn this off?

A very simple fix would be this:
Index: DataFrame-class.R
--- DataFrame-class.R   (revision 67116)
+++ DataFrame-class.R   (working copy)
@@ -183,7 +183,7 @@
     varlist <- unlist(varlist, recursive = FALSE, use.names = FALSE)
     nms <- unlist(varnames[ncols > 0L])
     if (check.names)
-      nms <- make.names(nms, unique = TRUE)
+      nms <- make.unique(nms)
     names(varlist) <- nms
   } else names(varlist) <- character(0)

Of course I didn't check all of the downstream effects, but I don't really
see why anything should rely on syntacticly correct names. In case there
is, the erratic check.names behavior certainly needs some fixing, after
all it could just be a normal column name in the DataFrame.


Bioc-devel at r-project.org mailing list

More information about the Bioc-devel mailing list