[Rd] subscripting a data.frame (without changing row order) changes internal row.names

Joshua Ulrich josh.m.ulrich at gmail.com
Mon Nov 10 21:05:34 CET 2014


On Mon, Nov 10, 2014 at 12:35 PM, Dr Gregory Jefferis
<jefferis at mrc-lmb.cam.ac.uk> wrote:
> Dear R-devel,
>
> Can anyone help me to understand this? It seems that subscripting the rows
> of a data.frame without actually changing their order, somehow changes an
> internal representation of row.names that is revealed by e.g.
> dput/dump/serialize
>
> I have read the docs and inspected the (R) code for data.frame, rownames,
> row.names and dput without enlightenment.
>
Look at ?.row_names_info (which is mentioned in the See Also section
of ?row.names) and its type argument.  Also see the discussion here:
http://stackoverflow.com/q/26468746/271616

> df=data.frame(a=1:10, b=1)
> dput(df)
> df2=df[1:nrow(df), ]
> # R thinks they are equal (so do I!)
> all.equal(df, df2)
> dput(df2)
>
> Looking at the output of the dputs
>
>> dput(df)
>
> structure(list(a = 1:10, b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names =
> c("a",
> "b"), row.names = c(NA, -10L), class = "data.frame")
>>
>> dput(df2)
>
> structure(list(a = 1:10, b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names =
> c("a",
> "b"), row.names = c(NA, 10L), class = "data.frame")
>
> we have row.names = c(NA, -10L) in the first case and row.names = c(NA, 10L)
> in the second, so somehow these objects have a different representation
>
> Can anyone explain why? This has come up because
>
The first are "automatic".  The second are a compact form of 1:10, as
mentioned in ?row.names.  I'm not certain of the root cause/reason,
but the second object will not have "automatic" rownames because you
have subset it with a non-missing 'i'.

>> library(digest)
>> digest(df)==digest(df2)
>
> [1] FALSE
>
> digest uses serialize under the hood, but serialize, dput and dump all show
> the same effect (I've pasted an example below using dump, md5sum from base
> R).
>
> Many thanks for any enlightenment! More generally is there any way to
> calculate a digest of a data.frame that could get round this issue or is
> that not possible?
>
> Best wishes,
>
> Greg.
>
>
> A digest using base R:
>
> library(tools)
> td=tempfile()
> dir.create(td)
> tempfiles=file.path(td,c("df", "df2"))
> dump("df",tempfiles[1])
> dump("df2",tempfiles[2])
> md5sum(tempfiles)
>
> # different md5sum
>
>> sessionInfo() # for my laptop but also observed on R 3.1.2
>
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-apple-darwin13.1.0 (64-bit)
>
> locale:
> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] tools     stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] nat_1.5.14      nat.utils_0.4.2 digest_0.6.4    Rvcg_0.9
> devtools_1.6.1  igraph_0.7.1
> [7] testthat_0.9.1  rgl_0.93.1098
>
> loaded via a namespace (and not attached):
>  [1] codetools_0.2-9   filehash_2.2-2    nabor_0.4.3       parallel_3.1.1
> plyr_1.8.1
>  [6] Rcpp_0.11.3       rstudio_0.98.1062 rstudioapi_0.1    XML_3.98-1.1
> yaml_2.1.13
>
> --
> Gregory Jefferis, PhD
> Division of Neurobiology
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue
> Cambridge Biomedical Campus
> Cambridge, CB2 OQH, UK
>
> http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis
> http://jefferislab.org
> http://flybrain.stanford.edu
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



-- 
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com



More information about the R-devel mailing list