[Bioc-devel] Syntactically correct names in DataFrames

Thu Jun 28 15:40:32 CEST 2012

Hi all,
I have been playing around with the DataFrame class a bit and realized
that it always enforces syntactically correct column names. Since it is a
generalization of the basic R data.frames I am not quite sure why that has
to be the case. 

Assuming I start with a regular data.frame with non-standard names:

> foo <- data.frame("1a"=1:3, b=4:6, check.names=FALSE)
> foo
  1a b
1  1 4
2  2 5
3  3 6


Coercing this into a DataFrame forces a name change:
> DataFrame(foo)
DataFrame with 3 rows and 2 columns
        X1a         b
  <integer> <integer>
1         1         4
2         2         5
3         3         6


> as(foo, "DataFrame")
DataFrame with 3 rows and 2 columns
        X1a         b
  <integer> <integer>
1         1         4
2         2         5
3         3         6


My first intuition was to try this:
> DataFrame(foo, check.names=FALSE)
DataFrame with 3 rows and 3 columns
Error in matrix(unlist(lapply(object, function(x) paste("<", class(x),  :
  length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In if (check.names) vnames <- make.names(vnames, unique = TRUE) :
  the condition has length > 1 and only the first element will be used

Now apparently there are multiple things going on here. First of all,
check.names is recycled by the DataFrame constructor because it thinks
that it is just another variable to add to the DataFrame later. The
initializer method however seems to recognize it for the coercion into a
data.frame, but it complains because it's length is >1. Also the show
method is broken because things don't really match anymore. The Data.Table
show method in IRanges seems to be the culprit here.

My simple question here is: why are syntactic names enforced at all. And
if that is a feature could't there be a way to turn this off?

A very simple fix would be this:
Index: DataFrame-class.R
===================================================================

--- DataFrame-class.R	(revision 67116)
+++ DataFrame-class.R	(working copy)
@@ -183,7 +183,7 @@
     varlist <- unlist(varlist, recursive = FALSE, use.names = FALSE)
     nms <- unlist(varnames[ncols > 0L])
     if (check.names)
-      nms <- make.names(nms, unique = TRUE)
+      nms <- make.unique(nms)
     names(varlist) <- nms
   } else names(varlist) <- character(0)



Of course I didn't check all of the downstream effects, but I don't really
see why anything should rely on syntacticly correct names. In case there
is, the erratic check.names behavior certainly needs some fixing, after
all it could just be a normal column name in the DataFrame.

Thanks,
Florian