[R] dplyr/summarize does not create a true data frame
John Posner
john.posner at MJBIOSTAT.COM
Sun Nov 23 17:42:58 CET 2014
Thanks to John Kane for an off-list consultation. As the following annotated transcript shows, it's the group_by() function that transforms a data frame into something else: a "grouped_df" object that *looks* identical to the original data frame (e.g. the rows are in the original order -- *not* grouped, as arrange() would do), but does not always act like a data frame.
> library(dplyr)
> # set up data frame, and show its structure [ see below for clean copy of dput() code ]
>
> frm = structure(list(Id = structure(1:10, .Label = c("P01", "P02",
+ "P03", "P04", "P05", "P06", "P07", "P08", "P09", "P10"), class = "factor"),
+ Sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label = c("Female",
+ "Male"), class = "factor"), Height = structure(c(1L, 1L,
+ 3L, 2L, 1L, 3L, 1L, 2L, 1L, 1L), .Label = c("Short", "Medium",
+ "Tall"), class = "factor"), Value = c(69.47, 64.61, 74.77,
+ 73.31, 64.76, 72.78, 64.64, 55.96, 60.45, 51.11)), .Names = c("Id",
+ "Sex", "Height", "Value"), row.names = c(NA, -10L), class = "data.frame")
>
> str(frm)
'data.frame': 10 obs. of 4 variables:
$ Id : Factor w/ 10 levels "P01","P02","P03",..: 1 2 3 4 5 6 7 8 9 10
$ Sex : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 2 1 2 2 1
$ Height: Factor w/ 3 levels "Short","Medium",..: 1 1 3 2 1 3 1 2 1 1
$ Value : num 69.5 64.6 74.8 73.3 64.8 ...
> # run group_by() on data frame, and show resulting structure
>
> after.group_by = frm %>% group_by(Sex, Height)
> str(after.group_by)
Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 10 obs. of 4 variables:
$ Id : Factor w/ 10 levels "P01","P02","P03",..: 1 2 3 4 5 6 7 8 9 10
$ Sex : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 2 1 2 2 1
$ Height: Factor w/ 3 levels "Short","Medium",..: 1 1 3 2 1 3 1 2 1 1
$ Value : num 69.5 64.6 74.8 73.3 64.8 ...
- attr(*, "vars")=List of 2
..$ : symbol Sex
..$ : symbol Height
- attr(*, "drop")= logi TRUE
- attr(*, "indices")=List of 5
..$ : int 1 6 9
..$ : int 2
..$ : int 0 4 8
..$ : int 3 7
..$ : int 5
- attr(*, "group_sizes")= int 3 1 3 2 1
- attr(*, "biggest_group_size")= int 3
- attr(*, "labels")='data.frame': 5 obs. of 2 variables:
..$ Sex : Factor w/ 2 levels "Female","Male": 1 1 2 2 2
..$ Height: Factor w/ 3 levels "Short","Medium",..: 1 3 1 2 3
..- attr(*, "vars")=List of 2
.. ..$ : symbol Sex
.. ..$ : symbol Height
> # the two data structure *seem* to be the same ...
> frm == after.group_by
Id Sex Height Value
[1,] TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE
...etc.
> # ... but they're not
> frm[4]
Value
1 69.47
2 64.61
...etc.
> after.group_by[4]
Error in eval(expr, envir, enclos) : index out of bounds
> # fortunately, we can convert back to a true data frame
> as.data.frame(after.group_by)[4]
Value
1 69.47
2 64.61
...etc.
################################## dput() code below
structure(list(Id = structure(1:10, .Label = c("P01", "P02",
"P03", "P04", "P05", "P06", "P07", "P08", "P09", "P10"), class = "factor"),
Sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label = c("Female",
"Male"), class = "factor"), Height = structure(c(1L, 1L,
3L, 2L, 1L, 3L, 1L, 2L, 1L, 1L), .Label = c("Short", "Medium",
"Tall"), class = "factor"), Value = c(69.47, 64.61, 74.77,
73.31, 64.76, 72.78, 64.64, 55.96, 60.45, 51.11)), .Names = c("Id",
"Sex", "Height", "Value"), row.names = c(NA, -10L), class = "data.frame")
> -----Original Message-----
> From: John Kane [mailto:jrkrideau at inbox.com]
> Sent: Friday, November 21, 2014 12:33 PM
> To: John Posner; 'r-help at r-project.org'
> Subject: RE: [R] dplyr/summarize does not create a true data frame
>
> Your code in creating 'frm' is not working for me and it is complicated enough
> that I don't want to work it out. See ?dput for a better way to supply data.
> Also see:
> https://github.com/hadley/devtools/wiki/Reproducibility
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-
> reproducible-example
>
> That said, I don't see why 'my.output[4]' is not working. Try something like
> str(frm) to see what you have there and/or resubmit the data in dput format
>
> See simple example below:
>
> dat1 <- data.frame(aa = sample(1:20, 100, replace = TRUE), bb = 1:100 )
> dat1[2]
>
> John Kane
> Kingston ON Canada
>
>
> > -----Original Message-----
> > From: john.posner at mjbiostat.com
> > Sent: Fri, 21 Nov 2014 17:10:16 +0000
> > To: r-help at r-project.org
> > Subject: [R] dplyr/summarize does not create a true data frame
> >
> > I got an error when trying to extract a 1-column subset of a data
> > frame (called "my.output") created by dplyr/summarize. The ncol()
> > function says that my.output has 4 columns, but "my.output[4]" fails.
> > Note that converting my.output using as.data.frame() makes for a happy
> ending.
> >
> > Is this the intended behavior of dplyr?
> >
> > Tx,
> > John
> >
> >> library(dplyr)
> >
> >> # set up data frame
> >> rows = 100
> >> repcnt = 50
> >> sexes = c("Female", "Male")
> >> heights = c("Med", "Short", "Tall")
> >
> >> frm = data.frame(
> > + Id = paste("P", sprintf("%04d", 1:rows), sep=""),
> > + Sex = sample(rep(sexes, repcnt), rows, replace=T),
> > + Height = sample(rep(heights, repcnt), rows, replace=T),
> > + V1 = round(runif(rows)*25, 2) + 50,
> > + V2 = round(runif(rows)*1000, 2) + 50,
> > + V3 = round(runif(rows)*350, 2) - 175
> > + )
> >>
> >> # use dplyr/summarize to create data frame my.output = frm %>%
> > + group_by(Sex, Height) %>%
> > + summarize(V1sum=sum(V1), V2sum=sum(V2))
> >
> >> # work with columns in the output data frame
> >> ncol(my.output)
> > [1] 4
> >
> >> my.output[1]
> > Source: local data frame [6 x 1]
> > Groups: Sex
> >
> > Sex
> > 1 Female
> > 2 Female
> > 3 Female
> > 4 Male
> > 5 Male
> > 6 Male
> >
> >> my.output[4]
> > Error in eval(expr, envir, enclos) : index out of bounds ########
> > ERROR HERE
> >
> >> as.data.frame(my.output)[4]
> > V2sum
> > 1 12427.97
> > 2 8449.82
> > 3 8610.97
> > 4 7249.20
> > 5 12616.91
> > 6 10372.15
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> __________________________________________________________
> __
> FREE ONLINE PHOTOSHARING - Share your photos online with your friends
> and family!
> Visit http://www.inbox.com/photosharing to find out more!
>
More information about the R-help
mailing list