May 15, 2018 @ eRum Budapest
… Since mid-1997 there has been a core group with write access to the R source, currently consisting of
Douglas Bates | John Chambers | Peter Dalgaard | Robert Gentleman | Kurt Hornik |
Ross Ihaka | Tomas Kalibera | Michael Lawrence | Friedrich Leisch | Uwe Ligges |
Thomas Lumley | Martin Maechler | Martin Morgan | Paul Murrell | Martyn Plummer |
Brian Ripley | Deepayan Sarkar | Duncan Temple Lang | Luke Tierney | Simon Urbanek |
plus Heiner Schwarte up to October 1999, Guido Masarotto up to June 2003, Stefano Iacus up to July 2014, Seth Falcon up to August 2015, and Duncan Murdoch up to September 2017.
R : became "viral" thanks to CRAN's package system (Kurt Hornik and CRAN team):
Instead of just "googling" and installing the package: Learn to "Use the source!" (a small effort, yes…): see Jenny Brian's https://github.com/jennybc/access-r-source
Do read package sources, even before you write your first package. Hence, start to love files such as <mypkg>_<n.m>.tar.gz
May 7, 2018 on the "I'm Programmer" page: https://www.improgrammer.net/stackoverflow-keyboard/
From , the free encyclopedia
<-
instead of the =
which has many other meanings in R. (R is a functional language: Assignments should stand out ):
[Alt]-
to produce $<-
(4 chars, incl. 2 spaces)([1]. Martin Mächler; Hadley Wickham in http://adv-r.had.co.nz/Style.html])
Do comment your R code: You ("your future self") will be glad already in 1 month
Use spaces in your R code .. and be better than pretty printers (e.g. from R's source):
droplevels.data.frame <- function(x, except = NULL, exclude, ...) { ix <- vapply(x, is.factor, NA) if (!is.null(except)) ix[except] <- FALSE x[ix] <- if(missing(exclude)) lapply(x[ix], droplevels) else lapply(x[ix], droplevels, exclude=exclude) x }
John Chambers: In R,
lapply(X, FUN, ...) do.call(<FUN>, args, ...)
Can use
if( <problem> ) stop(.....)
Nicer, as "assertion":
stopifnot( <must_be_1> , <must_be_2> , ...)
or alternatively, since R 3.5.0,
stopifnot(exprs = { <must_be_1> <must_be_2> ... })
(and yes, I know there are about a dozen packages for testing …)
…. (building R packages that are sustainable — another lecture) …
At Stanford, 2016 after useR!, Gabe Becker talked about his ideas on the potential of alternative object representations and (R Core member) Luke Tierney agreed to form a working group as some of this had been floating around for quite some time
Provisional docu: https://svn.r-project.org/R/branches/ALTREP/ALTREP.html (Orig: Nov.2016; with updates up to Nov. 13, 2017)
another Talk of Luke (at Ross Ihaka's farewell conference, Dec. 2017): http://homepage.stat.uiowa.edu/~luke/talks/nzsa-2017.pdf
Adapted from https://svn.r-project.org/R/branches/ALTREP/ALTREP.html#sample_implementations :
n1:n2
, seq_along(n)
and seq_len(m)
: represented compactly in terms of their start and end values.In R 3.4.4 (and older):
## > x <- 1:1e10 ## Error: cannot allocate vector of size 74.5 Gb
where in R 3.5.0++
and the ALTREP branch it works (very fast):
x <- 1:1e10
R 3.5.0++
(and the ALTREP branch), this is instantaneous (a few ms), where in R 3.4.4
it needs about 1.5 sec:system.time(for (i in 1:1e9) break)
## user system elapsed ## 0.013 0.000 0.013
The .Internal(inspect())
function shows that a compact representation has been used:
.Internal(inspect(x))
## @8805628 14 REALSXP g0c0 [MARK,NAM(3)] 1 : 10000000000 (compact)
sum(x)
is very smart (using alleged Gauss-as-first-grader formula \(n(n+1)/2\)):
system.time(print(sum(x)))
## [1] 5e+19
## user system elapsed ## 0 0 0
n <- length(x) sum(x) == n*(n+1)/2
## [1] TRUE
and mean(x)
has been adapted to extract (in blocks) and not expand x
(but still is much too slow here)
The following R code is not even imaginable in regular R before R 3.5.0:
x <- 1:1e15 object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not really
## 8000000000000048 bytes
is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted
## [1] FALSE
xs <- sort(x) # instantaneous ! .Internal(inspect(xs))
## @6dce478 14 REALSXP g0c0 [NAM(3)] 1 : 1000000000000000 (compact)
anyNA(x) # FALSE : i.e., R's *knows* it contains no NA's
## [1] FALSE
## (This does *NOT* fit on the slide: try( l4 <- x < 4 ) # no way [in principle *could* be compacted] x <- 10:1e15 .Internal(inspect(x))
## @7a4ae88 14 REALSXP g0c0 [NAM(3)] 10 : 1000000000000000 (compact)
as.character(1 : nrow)
In ALTREP
, the C-internal coerce()
function returns a deferred string coercion. Examples:
i8 <- 1:1e8 ce8 <- as.character(i8) .Internal(inspect(ce8))
## @6d17440 16 STRSXP g0c0 [NAM(3)] <deferred string conversion> ## @6e202a8 13 INTSXP g0c0 [NAM(3)] 1 : 100000000 (compact)
system.time(print(c35 <- ce8[1e7 + 3:5])) # very fast
## [1] "10000003" "10000004" "10000005"
## user system elapsed ## 0 0 0
.Internal(inspect(ce8)) # unchanged
## @6d17440 16 STRSXP g1c0 [MARK,NAM(3)] <deferred string conversion> ## @6e202a8 13 INTSXP g1c0 [MARK,NAM(3)] 1 : 100000000 (compact)
head(ce8)
## [1] "1" "2" "3" "4" "5" "6"
.Internal(inspect(ce8)) # unchanged
## @6d17440 16 STRSXP g1c0 [MARK,NAM(3)] <deferred string conversion> ## @6e202a8 13 INTSXP g1c0 [MARK,NAM(3)] 1 : 100000000 (compact)
.Internal(inspect(ce6 <- as.character(1 : 1e6))) # shorter version
## @6f9b938 16 STRSXP g0c0 [NAM(1)] <deferred string conversion> ## @6f9b890 13 INTSXP g0c0 [NAM(3)] 1 : 1000000 (compact)
system.time( ce6[1] <- "a" ) # 350 ms
## user system elapsed ## 0.320 0.016 0.337
.Internal(inspect(ce6)) # now <expanded string conversion>
## @7f306b7fc010 16 STRSXP g0c7 [NAM(1)] (len=1000000, tl=0) ## @1f809a0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a" ## @21b8d50 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "2" ## @1f6fc68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "3" ## @3d61c28 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII] [cached] "4" ## @212ce68 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII] [cached] "5" ## ...
lm()
Benefit: significant speedup in lm()
for fitting a model with large \(n\) (R 3.5.0++):
n <- 1e7 ; x <- rnorm(n) ; y <- rnorm(n) system.time(lm(y ~ x))
## user system elapsed ## 1.860 0.507 2.383
system.time(lm(y ~ x))
## user system elapsed ## 1.351 0.460 1.820
Where the elapsed
timings were 11.3
and 8.8
in R 3.4.4 (on same laptop).
Speedup: due entirely to not creating the row labels for the design matrix.
Yes, have faster ways: .lm.fit(cbind(1, x), y)
is \(3 \times\) faster.