[Bioc-devel] Syntactically correct names in DataFrames

Fri Jun 29 19:28:08 CEST 2012

Hi Michael,

Here is a somewhat related issue with duplicated colnames (using
the latest IRanges devel):

   > data.frame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
     aa aa
   1  2  B
   2  3  C
   3  4  D

OK.

   > DataFrame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
   DataFrame with 3 rows and 2 columns
            aa          aa
     <integer> <character>
   1         2           B
   2         3           C
   3         4           D

OK.

But then:

   > DF <- DataFrame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
   > validObject(DF)
   Error in validObject(DF) :
     invalid class “DataFrame” object: duplicate column names
   > DF[ , 2:1]
   Error in validObject(.Object) :
     invalid class “DataFrame” object: duplicate column names

Why?

   > data.frame(list(aa=2:4, aa=LETTERS[2:4]), check.names=FALSE)
     aa aa
   1  2  B
   2  3  C
   3  4  D

OK.

   > DataFrame(list(aa=2:4, aa=LETTERS[2:4]), check.names=FALSE)
   DataFrame with 3 rows and 2 columns
            aa        aa.1
     <integer> <character>
   1         2           B
   2         3           C
   3         4           D

Not OK.

I also tend to think that automatic name mangling features is generally
causing more problems than it solves (if it solves any problem at all).
Same thing with automatic coercion from character to factor (which I'm
glad DataFrame() is not trying to mimic).

Cheers,
H.

On 06/28/2012 06:58 AM, Michael Lawrence wrote:
> Hi Florian,
>
> A guiding principle in the design of DataFrame was consistency with
> data.frame, so that is why we check for syntactic validity of the column
> names.  The underlying reasons for this are probably historic and related
> to the rough equivalence between lists and environments.
>
> As for the error you encountered below, that seems to be fixed in devel.
>
> Michael
>
> On Thu, Jun 28, 2012 at 6:40 AM, Hahne, Florian
> <florian.hahne at novartis.com>wrote:
>
>> Hi all,
>> I have been playing around with the DataFrame class a bit and realized
>> that it always enforces syntactically correct column names. Since it is a
>> generalization of the basic R data.frames I am not quite sure why that has
>> to be the case.
>>
>> Assuming I start with a regular data.frame with non-standard names:
>>
>>> foo <- data.frame("1a"=1:3, b=4:6, check.names=FALSE)
>>> foo
>>   1a b
>> 1  1 4
>> 2  2 5
>> 3  3 6
>>
>>
>> Coercing this into a DataFrame forces a name change:
>>> DataFrame(foo)
>> DataFrame with 3 rows and 2 columns
>>         X1a         b
>>   <integer> <integer>
>> 1         1         4
>> 2         2         5
>> 3         3         6
>>
>>
>>> as(foo, "DataFrame")
>> DataFrame with 3 rows and 2 columns
>>         X1a         b
>>   <integer> <integer>
>> 1         1         4
>> 2         2         5
>> 3         3         6
>>
>>
>> My first intuition was to try this:
>>> DataFrame(foo, check.names=FALSE)
>> DataFrame with 3 rows and 3 columns
>> Error in matrix(unlist(lapply(object, function(x) paste("<", class(x),  :
>>   length of 'dimnames' [2] not equal to array extent
>> In addition: Warning message:
>> In if (check.names) vnames <- make.names(vnames, unique = TRUE) :
>>   the condition has length > 1 and only the first element will be used
>>
>> Now apparently there are multiple things going on here. First of all,
>> check.names is recycled by the DataFrame constructor because it thinks
>> that it is just another variable to add to the DataFrame later. The
>> initializer method however seems to recognize it for the coercion into a
>> data.frame, but it complains because it's length is >1. Also the show
>> method is broken because things don't really match anymore. The Data.Table
>> show method in IRanges seems to be the culprit here.
>>
>> My simple question here is: why are syntactic names enforced at all. And
>> if that is a feature could't there be a way to turn this off?
>>
>> A very simple fix would be this:
>> Index: DataFrame-class.R
>> ===================================================================
>> --- DataFrame-class.R   (revision 67116)
>> +++ DataFrame-class.R   (working copy)
>> @@ -183,7 +183,7 @@
>>      varlist <- unlist(varlist, recursive = FALSE, use.names = FALSE)
>>      nms <- unlist(varnames[ncols > 0L])
>>      if (check.names)
>> -      nms <- make.names(nms, unique = TRUE)
>> +      nms <- make.unique(nms)
>>      names(varlist) <- nms
>>    } else names(varlist) <- character(0)
>>
>>
>>
>> Of course I didn't check all of the downstream effects, but I don't really
>> see why anything should rely on syntacticly correct names. In case there
>> is, the erratic check.names behavior certainly needs some fixing, after
>> all it could just be a normal column name in the DataFrame.
>>
>> Thanks,
>> Florian
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319