[Bioc-devel] Syntactically correct names in DataFrames

Fri Jun 29 10:30:52 CEST 2012

Oh, and of course thanks for pointing out the fix in devel. Works as
expected now.
Florian
-- 






From:  Michael Lawrence <lawrence.michael at gene.com>
Date:  Thursday, June 28, 2012 3:58 PM
To:  NIBR <florian.hahne at novartis.com>
Cc:  "bioc-devel at r-project.org" <bioc-devel at r-project.org>
Subject:  Re: [Bioc-devel] Syntactically correct names in DataFrames


Hi Florian, 

A guiding principle in the design of DataFrame was consistency with
data.frame, so that is why we check for syntactic validity of the column
names.  The underlying reasons for this are probably historic and related
to the rough equivalence between lists and environments.

As for the error you encountered below, that seems to be fixed in devel.

Michael

On Thu, Jun 28, 2012 at 6:40 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:

Hi all,
I have been playing around with the DataFrame class a bit and realized
that it always enforces syntactically correct column names. Since it is a
generalization of the basic R data.frames I am not quite sure why that has
to be the case.

Assuming I start with a regular data.frame with non-standard names:

> foo <- data.frame("1a"=1:3, b=4:6, check.names=FALSE)
> foo
  1a b
1  1 4
2  2 5
3  3 6


Coercing this into a DataFrame forces a name change:
> DataFrame(foo)
DataFrame with 3 rows and 2 columns
        X1a         b
  <integer> <integer>
1         1         4
2         2         5
3         3         6


> as(foo, "DataFrame")
DataFrame with 3 rows and 2 columns
        X1a         b
  <integer> <integer>
1         1         4
2         2         5
3         3         6


My first intuition was to try this:
> DataFrame(foo, check.names=FALSE)
DataFrame with 3 rows and 3 columns
Error in matrix(unlist(lapply(object, function(x) paste("<", class(x),  :
  length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In if (check.names) vnames <- make.names(vnames, unique = TRUE) :
  the condition has length > 1 and only the first element will be used

Now apparently there are multiple things going on here. First of all,
check.names is recycled by the DataFrame constructor because it thinks
that it is just another variable to add to the DataFrame later. The
initializer method however seems to recognize it for the coercion into a
data.frame, but it complains because it's length is >1. Also the show
method is broken because things don't really match anymore. The Data.Table
show method in IRanges seems to be the culprit here.

My simple question here is: why are syntactic names enforced at all. And
if that is a feature could't there be a way to turn this off?

A very simple fix would be this:
Index: DataFrame-class.R
===================================================================

--- DataFrame-class.R   (revision 67116)
+++ DataFrame-class.R   (working copy)
@@ -183,7 +183,7 @@
     varlist <- unlist(varlist, recursive = FALSE, use.names = FALSE)
     nms <- unlist(varnames[ncols > 0L])
     if (check.names)
-      nms <- make.names(nms, unique = TRUE)
+      nms <- make.unique(nms)
     names(varlist) <- nms
   } else names(varlist) <- character(0)



Of course I didn't check all of the downstream effects, but I don't really
see why anything should rely on syntacticly correct names. In case there
is, the erratic check.names behavior certainly needs some fixing, after
all it could just be a normal column name in the DataFrame.

Thanks,
Florian

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel