[R] Odd behaviour of mean() with a numeric column in a tibble

Chris Evans chrishold at psyctc.org
Sat Dec 10 22:57:57 CET 2016


Thanks to both Jeff and Ista for your inputs some days back.  I confess I was _indeed_ too tired to be thinking well and laterally, and even to be copying things into Emails successfully.

I have since had more sleep (!) and I have read ?`[[`, gone back to the pertinent parts of "Introduction to R" and generally pondered all this.  I confess I had always avoided [[ and only ever used it for lists that were not data frames.  I can now see just how badly I was misguessing its behaviour: apologies, I should have realised that I needed to go right back to basics.

I _can_ see that there are things in the behaviour of data frames that are not that obvious but I had become very used to them.  I can see values in converting to using tibbles instead of data frames and may try to do that.

However, I think the documentation for tibble would be improved for people like myself if it started with something that made it even clearer that tibbles are lists, just as data frames are, but that whereas a data frame has a single class(df) of "data.frame", class(tibble) is:
c("tbl_df","tbl","data.frame").

I can now see that what I get from ?tibble, i.e. "tibble is a trimmed down version of data.frame" is probably technically true though I'd describe it as a rationalised or even a beefed up version of data.frame.  I can also now see that what I find in https://cran.r-project.org/web/packages/tibble/tibble.pdf:

"[ Never simplifies (drops), so always returns data.frame"

is true, but only to the extent that any tibble is still a data.frame but with "data.frame" moved to the third position in the classes of the tibble where it would be the first and only class were it a pure data.frame.  I can also see now that that is not really inconsistent with what I get in https://github.com/tidyverse/tibble:

"Tibbles also clearly delineate [ and [[: [ always returns another tibble, [[ always returns a vector. No more drop = FALSE!"

However, I think it would be better if the tibble.pdf document said:

"[ Never simplifies (drops), so always returns tibble" even though "[ Never simplifies (drops), so always returns data.frame" is technically true, up to and including passing is.data.frame() as 

Finally, I think I can see that if want various functions I have written that worked fine on data frames, but which depended on indexing or subsetting those data frames using [,i] or sometimes [,i:j]to select vectors or matrices, then I will have to modify them so they test whether the input is a simple data frame or a data frame that is also a tibble.  I guess that I could have trapped things had my functions (where appropriate) had an is.numeric() input check ... and that I have to use an is.tibble() check, not an is.data.frame() check to distinguish the two!

Ah well, even after years of part-time use of R, I guess it's been good for my soul and my deeper and wider understanding of R to go right back to the basics.

Thanks again to you both.  I am posting here to convey thanks and in case this is useful to anyone like myself who benefits from a bit more narrative than is usually offered by R definitions and help entries.

Chris


----- Original Message -----
> From: "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us>
> To: "Chris Evans" <chrishold at psyctc.org>, "r-helpr-project.org" <r-help at r-project.org>
> Sent: Tuesday, 6 December, 2016 23:23:28
> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a tibble

> You really need sleep. Then you need to read
> 
> ?`[[`
> 
> and in particular read about the second argument to the `[[` function, since you
> don't seem to understand what it is for. Maybe reread the Introduction to R
> document that comes with R.
> 
> The simplest solution is to treat `[[` as supporting one index and `[` as
> supporting either one or two.
> 
> As for expecting any form of row indexing of data frames or tibbles to return a
> vector, that is hopeless because each column can have a different type.  dta[
> 1, ] returns exactly what it has to return to avoid losing fidelity. If you
> really need row indexing to return a vector you should be using a matrix.
> --
> Sent from my phone. Please excuse my brevity.
> 
> On December 6, 2016 2:10:15 PM PST, Chris Evans <chrishold at psyctc.org> wrote:
>>{{SIGH}}
>>
>>You are absolutely right.
>>
>>I wonder if I am losing some cognitive capacities that are needed to be
>>part of the evolving R community. It seems to me that if a tibble is
>>designed to be an enhanced replacement for a dataframe then it
>>shouldn't quite so radically change things.
>>
>>I notice that the documentation on tibble says "[ Never simplifies
>>(drops), so always returns data.frame"
>>That is much less explicit than I would have liked and actually doesn't
>>seem to be true. In fact, as you rightly say, it generally, but not
>>quite always, returns a tibble. In fact it can be fooled into a vector
>>of length 1.
>>
>>> tmpTibble[[1,]]
>>Error in `[[.data.frame`(tmpTibble, 1, ) :
>>argument "..2" is missing, with no default
>>
>>> tmpTibble[1]
>># A tibble: 26 × 1
>>ID
>><chr>
>>1 a
>>2 b
>>3 c
>>4 d
>>5 e
>>6 f
>>7 g
>>8 h
>>9 i
>>10 j
>># ... with 16 more rows
>>> tmpTibble[,1]
>># A tibble: 26 × 1
>>ID
>><chr>
>>1 a
>>2 b
>>3 c
>>4 d
>>5 e
>>6 f
>>7 g
>>8 h
>>9 i
>>10 j
>># ... with 16 more rows
>>> tmpTibble[1,]
>>Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a",
>>:
>>replacement element 3 is a matrix/data frame of 26 rows, need 1
>>In addition: Warning messages:
>>1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
>>replacement element 1 has 26 rows to replace 1 rows
>>2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
>>replacement element 2 has 26 rows to replace 1 rows
>>> tmpTibble[1,1:26]
>>Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
>>15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26
>>> tmpTibble[[1,2]]
>>[1] 1
>>> str(tmpTibble[[1,2]])
>>int 1
>>> str(tmpTibble[[1:2,2]])
>>Error in col[[i, exact = exact]] :
>>attempt to select more than one element in vectorIndex
>>> 
>>> tmpTibble[[1,1:2]]
>>[1] "b"
>>> 
>>
>>So [[a,b]] works if a and b are legal with the dimensions of the tibble
>>and if a is of length 1 but returns NOT a tibble but a vector of length
>>1 (I think), I can see that's logical but not what it says in the
>>documentation.
>>
>>[[a]] and [[,a]] return the same result, that seems excessively
>>tolerant to me.
>>
>>[[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a
>>tibble.
>>
>>And row subsetting/indexing has gone.
>>
>>Why create replacement for a dataframe that has no row indexing and so
>>radically redefines column indexing, in fact redefines the whole of
>>indexing and subsetting?
>>
>>OK. I will go to sleep now and hope to feel less dumb(ed) when I wake.
>>Perhaps Prof. Wickham or someone can spell out a bit less tersely, and
>>I think incompletely, than the tibble documentation does, why all this
>>is good.
>>
>>Thanks anyway Ista, you certainly hit the issue!
>>
>>Very best all,
>>
>>Chris
>>
>>> From: "Ista Zahn" <istazahn at gmail.com>
>>> To: "Chris Evans" <chrishold at psyctc.org>
>>> Cc: "r-helpr-project.org" <r-help at r-project.org>
>>> Sent: Tuesday, 6 December, 2016 21:40:41
>>> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a
>>tibble
>>
>>> Not at a computer to check right now, but I believe single bracket
>>indexing a
>>> tibble always returns a tibble. To extract a vector use [[
>>
>>> On Dec 6, 2016 4:28 PM, "Chris Evans" < chrishold at psyctc.org > wrote:
>>
>>>> I hope I am obeying the list rules here. I am using a raw R IDE for
>>this and
>>> > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit)
>>
>>> > Here is a reproducible example. Code only first
>>
>>> > require(tibble)
>>> > tmpTibble <- tibble(ID=letters,num=1:26)
>>> > min(tmpTibble[,2]) # fine
>>> > max(tmpTibble[,2]) # fine
>>> > median(tmpTibble[,2]) # not fine
>>> > mean(tmpTibble[,2]) # not fine
>>
>>> I think you want
>>
>>> mean(tmpTibble[[2]]
>>
>>> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>>> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be
>>necessary?!
>>> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>>> > newMedianFun(tmpTibble[,2]) # ditto
>>> > str(tmpTibble[,2])
>>
>>> > ### then I tried this to make sure it wasn't about having fed in
>>integers
>>
>>> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>>> > tmpTibble2
>>> > mean(tmpTibble2[,3]) # not fine, not about integers!
>>
>>
>>>> ### before I just created tmpTibble2 I found myself trying to add a
>>column to
>>> > tmpTibble
>>> > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>>> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>>> > ### and oddly enough ...
>>> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>>
>>> > Now here it is with the output:
>>
>>> > > require(tibble)
>>> > Loading required package: tibble
>>> > > tmpTibble <- tibble(ID=letters,num=1:26)
>>> > > min(tmpTibble[,2]) # fine
>>> > [1] 1
>>> > > max(tmpTibble[,2]) # fine
>>> > [1] 26
>>> > > median(tmpTibble[,2]) # not fine
>>> > Error in median.default(tmpTibble[, 2]) : need numeric data
>>> > > mean(tmpTibble[,2]) # not fine
>>> > [1] NA
>>> > Warning message:
>>> > In mean.default(tmpTibble[, 2]) :
>>> > argument is not numeric or logical: returning NA
>>> > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>>> > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't
>>be necessary?!
>>> > [1] 13.5
>>> > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>>> > > newMedianFun(tmpTibble[,2]) # ditto
>>> > [1] 13.5
>>> > > str(tmpTibble[,2])
>>> > Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 26 obs. of 1 variable:
>>> > $ num: int 1 2 3 4 5 6 7 8 9 10 ...
>>
>>> > > ### then I tried this to make sure it wasn't about having fed in
>>integers
>>
>>> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>>> > > tmpTibble2
>>> > # A tibble: 26 × 3
>>> > ID num num2
>>> > <chr> <int> <dbl>
>>> > 1 a 1 0.1
>>> > 2 b 2 0.2
>>> > 3 c 3 0.3
>>> > 4 d 4 0.4
>>> > 5 e 5 0.5
>>> > 6 f 6 0.6
>>> > 7 g 7 0.7
>>> > 8 h 8 0.8
>>> > 9 i 9 0.9
>>> > 10 j 10 1.0
>>> > # ... with 16 more rows
>>> > > mean(tmpTibble2[,3]) # not fine, not about integers!
>>> > [1] NA
>>> > Warning message:
>>> > In mean.default(tmpTibble2[, 3]) :
>>> > argument is not numeric or logical: returning NA
>>
>>
>>>> > ### before I just created tmpTibble2 I found myself trying to add
>>a column to
>>> > > tmpTibble
>>> > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>>> > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>>> > > ### and oddly enough ...
>>> > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>>> > Error: Each variable must be a 1d atomic vector or list.
>>> > Problem variables: 'newNum'
>>
>>
>>
>>>> I discovered this when I hit odd behaviour after using read_spss()
>>from the
>>>> haven package for the first time as it seemed to be offering a step
>>forward
>>>> over good old read.spss() from the excellent foreign package. I am
>>reporting it
>>>> here not directly to Prof. Wickham as the issues seem rather general
>>though I'm
>>>> guessing that it needs to be fixed with a fix to tibble. Or perhaps
>>I've
>>> > completely missed something.
>>
>>> > TIA,
>>
>>> > Chris
>>
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>	[[alternative HTML version deleted]]
>>
>>______________________________________________
>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list