[R] the difference between "-" and "!" between base and data.table package
David Winsemius
dwinsemius at comcast.net
Sun Apr 16 10:00:09 CEST 2017
> On Apr 15, 2017, at 5:18 PM, Carl Sutton via R-help <r-help at r-project.org> wrote:
>
> Hi
>
>
> I normally use package data.table but today was doing some base R coding. Had a problem for a bit which I finally resolved. I was attempting to separate a data frame between train and test sets, and in base R was using the "!" to exclude training set indices from the data frame. All I was getting was zero observations. Changed to using "-" and it worked. I recalled that in data.table the "!" function worked, so created this little bit of code.
>
> # Base R Functions
> str(mtcars)
> train_indices <- sample(nrow(mtcars), round(0.75*nrow(mtcars)))
> train <- mtcars[train_indices,]
> mode(train_indices); class(train_indices)
> test <- mtcars[!train_indices,] # the "!" function returning 0 observations
The arguments you are supplying:
> table( !train_indices )
FALSE
24
> test_1 <- mtcars[-train_indices,]
> identical(test, test_1)
>
> # Using data.table package
> library(data.table)
> dt1 <- data.table(mtcars)
> train_indices <- sample(nrow(dt1), round(0.75*nrow(dt1)))
> train <- dt1[train_indices,]
The data.table "[" function has very different syntax and evaluation rules than does the data.frame "[" function, but I guess you know that.
> mode(train_indices); class(train_indices)
> test <- dt1[!train_indices,] # the "!" function
> test_1 <- dt1[-train_indices,]
> identical(test, test_1)
> The documentation appears to me to accept "!" in base, so do I have some kind of ridiculous error or ..??
Not sure about "ridiculous" and you have not actually said what it was that _you_ were questioning.
If it is the lack of any return from `test <- mtcars[!train_indices,]` than it could be argued that was a ridiculous expectation at least according to the rules of vector evaluation in row selection that I thought I understood. Giving a vector of FALSE values to `[.data.frame` would not reasonably be expected to return anything. Whether giving a vector of only FALSE's to `[.data.table` and actually getting something back does seem kind of unexpected to me, but clearly it didn't seem ridiculous to Matt Dowle. Clearly the recycling rules for `[.data.table are different than those of `[.data.frame`. Data.tables don't use rownames.
The results from:
> dt1[rep(FALSE,24), ]
Error in `[.data.table`(dt1, rep(FALSE, 24), ) :
i evaluates to a logical vector length 24 but there are 32 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
... is different than from
dt1[!train_indices, ] # get 8 rows.
To me that doesn't make sense.
I generally use %in% for row selection. But many people would also find this pair of results "ridiculous":
> mtcars[ which( train_indices %in% 50:100), ]
[1] mpg cyl disp hp drat wt qsec vs am gear carb
<0 rows> (or 0-length row.names)
> mtcars[ -which( train_indices %in% 50:100), ] # bad idea to use minus before which()
[1] mpg cyl disp hp drat wt qsec vs am gear carb
<0 rows> (or 0-length row.names)
Yes, I know that some people think the `which` is not needed. I'm not one of them.
--
David.
> Carl Sutton
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list