[Bioc-devel] Syntactically correct names in DataFrames

Sat Jun 30 06:22:53 CEST 2012

On 06/29/2012 02:35 PM, Michael Lawrence wrote:
>
>
> On Fri, Jun 29, 2012 at 10:28 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Michael,
>
>     Here is a somewhat related issue with duplicated colnames (using
>     the latest IRanges devel):
>
>       > data.frame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
>         aa aa
>       1  2  B
>       2  3  C
>       3  4  D
>
>     OK.
>
>       > DataFrame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
>       DataFrame with 3 rows and 2 columns
>                aa          aa
>         <integer> <character>
>       1         2           B
>       2         3           C
>       3         4           D
>
>     OK.
>
>     But then:
>
>       > DF <- DataFrame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
>       > validObject(DF)
>       Error in validObject(DF) :
>         invalid class “DataFrame” object: duplicate column names
>       > DF[ , 2:1]
>       Error in validObject(.Object) :
>         invalid class “DataFrame” object: duplicate column names
>
>     Why?
>
>
> Because it's a bug. I added check.names last release at Florian's
> request and didn't test all of this. Thanks for finding these. In my
> book, an error should be thrown when there are duplicate names and
> isTRUE(check.names). Anyway, I checked in the fixes.

Thanks for the fix. I was worried that validation rejecting duplicated
colnames would be intentional. Looks like I don't need to worry anymore.

Thanks again,
H.

>
> Michael
>
>       > data.frame(list(aa=2:4, aa=LETTERS[2:4]), check.names=FALSE)
>         aa aa
>       1  2  B
>       2  3  C
>       3  4  D
>
>     OK.
>
>       > DataFrame(list(aa=2:4, aa=LETTERS[2:4]), check.names=FALSE)
>       DataFrame with 3 rows and 2 columns
>                aa        aa.1
>         <integer> <character>
>       1         2           B
>       2         3           C
>       3         4           D
>
>     Not OK.
>
>     I also tend to think that automatic name mangling features is generally
>     causing more problems than it solves (if it solves any problem at all).
>     Same thing with automatic coercion from character to factor (which I'm
>     glad DataFrame() is not trying to mimic).
>
>     Cheers,
>     H.
>
>
>
>     On 06/28/2012 06:58 AM, Michael Lawrence wrote:
>
>         Hi Florian,
>
>         A guiding principle in the design of DataFrame was consistency with
>         data.frame, so that is why we check for syntactic validity of
>         the column
>         names.  The underlying reasons for this are probably historic
>         and related
>         to the rough equivalence between lists and environments.
>
>         As for the error you encountered below, that seems to be fixed
>         in devel.
>
>         Michael
>
>         On Thu, Jun 28, 2012 at 6:40 AM, Hahne, Florian
>         <florian.hahne at novartis.com
>         <mailto:florian.hahne at novartis.com>>__wrote:
>
>             Hi all,
>             I have been playing around with the DataFrame class a bit
>             and realized
>             that it always enforces syntactically correct column names.
>             Since it is a
>             generalization of the basic R data.frames I am not quite
>             sure why that has
>             to be the case.
>
>             Assuming I start with a regular data.frame with non-standard
>             names:
>
>                 foo <- data.frame("1a"=1:3, b=4:6, check.names=FALSE)
>                 foo
>
>               1a b
>             1  1 4
>             2  2 5
>             3  3 6
>
>
>             Coercing this into a DataFrame forces a name change:
>
>                 DataFrame(foo)
>
>             DataFrame with 3 rows and 2 columns
>                     X1a         b
>               <integer> <integer>
>             1         1         4
>             2         2         5
>             3         3         6
>
>
>                 as(foo, "DataFrame")
>
>             DataFrame with 3 rows and 2 columns
>                     X1a         b
>               <integer> <integer>
>             1         1         4
>             2         2         5
>             3         3         6
>
>
>             My first intuition was to try this:
>
>                 DataFrame(foo, check.names=FALSE)
>
>             DataFrame with 3 rows and 3 columns
>             Error in matrix(unlist(lapply(object, function(x) paste("<",
>             class(x),  :
>               length of 'dimnames' [2] not equal to array extent
>             In addition: Warning message:
>             In if (check.names) vnames <- make.names(vnames, unique =
>             TRUE) :
>               the condition has length > 1 and only the first element
>             will be used
>
>             Now apparently there are multiple things going on here.
>             First of all,
>             check.names is recycled by the DataFrame constructor because
>             it thinks
>             that it is just another variable to add to the DataFrame
>             later. The
>             initializer method however seems to recognize it for the
>             coercion into a
>             data.frame, but it complains because it's length is >1. Also
>             the show
>             method is broken because things don't really match anymore.
>             The Data.Table
>             show method in IRanges seems to be the culprit here.
>
>             My simple question here is: why are syntactic names enforced
>             at all. And
>             if that is a feature could't there be a way to turn this off?
>
>             A very simple fix would be this:
>             Index: DataFrame-class.R
>             ==============================__==============================__=======
>             --- DataFrame-class.R   (revision 67116)
>             +++ DataFrame-class.R   (working copy)
>             @@ -183,7 +183,7 @@
>                  varlist <- unlist(varlist, recursive = FALSE, use.names
>             = FALSE)
>                  nms <- unlist(varnames[ncols > 0L])
>                  if (check.names)
>             -      nms <- make.names(nms, unique = TRUE)
>             +      nms <- make.unique(nms)
>                  names(varlist) <- nms
>                } else names(varlist) <- character(0)
>
>
>
>             Of course I didn't check all of the downstream effects, but
>             I don't really
>             see why anything should rely on syntacticly correct names.
>             In case there
>             is, the erratic check.names behavior certainly needs some
>             fixing, after
>             all it could just be a normal column name in the DataFrame.
>
>             Thanks,
>             Florian
>
>             _________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>             mailing list
>             https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>                 [[alternative HTML version deleted]]
>
>
>         _________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         mailing list
>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319