[Bioc-devel] Syntactically correct names in DataFrames
Hervé Pagès
hpages at fhcrc.org
Sat Jun 30 06:22:53 CEST 2012
On 06/29/2012 02:35 PM, Michael Lawrence wrote:
>
>
> On Fri, Jun 29, 2012 at 10:28 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi Michael,
>
> Here is a somewhat related issue with duplicated colnames (using
> the latest IRanges devel):
>
> > data.frame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
> aa aa
> 1 2 B
> 2 3 C
> 3 4 D
>
> OK.
>
> > DataFrame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
> DataFrame with 3 rows and 2 columns
> aa aa
> <integer> <character>
> 1 2 B
> 2 3 C
> 3 4 D
>
> OK.
>
> But then:
>
> > DF <- DataFrame(aa=2:4, aa=LETTERS[2:4], check.names=FALSE)
> > validObject(DF)
> Error in validObject(DF) :
> invalid class “DataFrame” object: duplicate column names
> > DF[ , 2:1]
> Error in validObject(.Object) :
> invalid class “DataFrame” object: duplicate column names
>
> Why?
>
>
> Because it's a bug. I added check.names last release at Florian's
> request and didn't test all of this. Thanks for finding these. In my
> book, an error should be thrown when there are duplicate names and
> isTRUE(check.names). Anyway, I checked in the fixes.
Thanks for the fix. I was worried that validation rejecting duplicated
colnames would be intentional. Looks like I don't need to worry anymore.
Thanks again,
H.
>
> Michael
>
> > data.frame(list(aa=2:4, aa=LETTERS[2:4]), check.names=FALSE)
> aa aa
> 1 2 B
> 2 3 C
> 3 4 D
>
> OK.
>
> > DataFrame(list(aa=2:4, aa=LETTERS[2:4]), check.names=FALSE)
> DataFrame with 3 rows and 2 columns
> aa aa.1
> <integer> <character>
> 1 2 B
> 2 3 C
> 3 4 D
>
> Not OK.
>
> I also tend to think that automatic name mangling features is generally
> causing more problems than it solves (if it solves any problem at all).
> Same thing with automatic coercion from character to factor (which I'm
> glad DataFrame() is not trying to mimic).
>
> Cheers,
> H.
>
>
>
> On 06/28/2012 06:58 AM, Michael Lawrence wrote:
>
> Hi Florian,
>
> A guiding principle in the design of DataFrame was consistency with
> data.frame, so that is why we check for syntactic validity of
> the column
> names. The underlying reasons for this are probably historic
> and related
> to the rough equivalence between lists and environments.
>
> As for the error you encountered below, that seems to be fixed
> in devel.
>
> Michael
>
> On Thu, Jun 28, 2012 at 6:40 AM, Hahne, Florian
> <florian.hahne at novartis.com
> <mailto:florian.hahne at novartis.com>>__wrote:
>
> Hi all,
> I have been playing around with the DataFrame class a bit
> and realized
> that it always enforces syntactically correct column names.
> Since it is a
> generalization of the basic R data.frames I am not quite
> sure why that has
> to be the case.
>
> Assuming I start with a regular data.frame with non-standard
> names:
>
> foo <- data.frame("1a"=1:3, b=4:6, check.names=FALSE)
> foo
>
> 1a b
> 1 1 4
> 2 2 5
> 3 3 6
>
>
> Coercing this into a DataFrame forces a name change:
>
> DataFrame(foo)
>
> DataFrame with 3 rows and 2 columns
> X1a b
> <integer> <integer>
> 1 1 4
> 2 2 5
> 3 3 6
>
>
> as(foo, "DataFrame")
>
> DataFrame with 3 rows and 2 columns
> X1a b
> <integer> <integer>
> 1 1 4
> 2 2 5
> 3 3 6
>
>
> My first intuition was to try this:
>
> DataFrame(foo, check.names=FALSE)
>
> DataFrame with 3 rows and 3 columns
> Error in matrix(unlist(lapply(object, function(x) paste("<",
> class(x), :
> length of 'dimnames' [2] not equal to array extent
> In addition: Warning message:
> In if (check.names) vnames <- make.names(vnames, unique =
> TRUE) :
> the condition has length > 1 and only the first element
> will be used
>
> Now apparently there are multiple things going on here.
> First of all,
> check.names is recycled by the DataFrame constructor because
> it thinks
> that it is just another variable to add to the DataFrame
> later. The
> initializer method however seems to recognize it for the
> coercion into a
> data.frame, but it complains because it's length is >1. Also
> the show
> method is broken because things don't really match anymore.
> The Data.Table
> show method in IRanges seems to be the culprit here.
>
> My simple question here is: why are syntactic names enforced
> at all. And
> if that is a feature could't there be a way to turn this off?
>
> A very simple fix would be this:
> Index: DataFrame-class.R
> ==============================__==============================__=======
> --- DataFrame-class.R (revision 67116)
> +++ DataFrame-class.R (working copy)
> @@ -183,7 +183,7 @@
> varlist <- unlist(varlist, recursive = FALSE, use.names
> = FALSE)
> nms <- unlist(varnames[ncols > 0L])
> if (check.names)
> - nms <- make.names(nms, unique = TRUE)
> + nms <- make.unique(nms)
> names(varlist) <- nms
> } else names(varlist) <- character(0)
>
>
>
> Of course I didn't check all of the downstream effects, but
> I don't really
> see why anything should rely on syntacticly correct names.
> In case there
> is, the erratic check.names behavior certainly needs some
> fixing, after
> all it could just be a normal column name in the DataFrame.
>
> Thanks,
> Florian
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
> [[alternative HTML version deleted]]
>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list