Dec 5, 2018 @ Zurich R Meetup

R and Me: How I got envolved

  • History of R has been well described on Wikipedia
  • notably its ref. [19] :
  • I became involved in 1993, notably in this crucial step:
  • 1995: R became Free Software (licenced by GPL. FOSS:= Free and Open-Source),
  • … have been part of the R Core team since its formation in 1997.
  • Nice short video What’s R? (July 2013) [bit.ly/whats_R][https://bit.ly/whats_R]

R is made by R Core Team

R – made by R Core Team

https://www.r-project.org/contributors.html

… Since mid-1997 there has been a core group with write access to the R source, currently consisting of

Douglas Bates John Chambers Peter Dalgaard Robert Gentleman Kurt Hornik
Ross Ihaka Tomas Kalibera Michael Lawrence Friedrich Leisch Uwe Ligges
Thomas Lumley Martin Maechler Martin Morgan Paul Murrell Martyn Plummer
Brian Ripley Deepayan Sarkar Duncan Temple Lang Luke Tierney Simon Urbanek

plus Heiner Schwarte up to October 1999, Guido Masarotto up to June 2003, Stefano Iacus up to July 2014, Seth Falcon up to August 2015, and Duncan Murdoch up to September 2017.

R == “base” + packages

R : became “viral” thanks to CRAN’s package system (Kurt Hornik and CRAN team):

R Programming

1. The ART of R Programming

Learn from the Masters

Obi-Wan Kenobi to Luke Skywalker: “Use the Force, Luke!” (‘Starwars’)

Use the Source!

Read R Source code (to learn from the Masters)

  • Download the R source, and package sources instead of just installing them.
  • CRAN (and Bioconductor) should be your primary source,
  • everything else (github, …) only if you know that the authors and the website are trustworthy
  • Do read package sources, even before you write your first package. Hence, start to love files such as <mypkg>_<n.m>.tar.gz

R Programming

2. R Programming with STYLE

The Stackoverflow Keyboard

DRY, not WET !

From , the free encyclopedia

  • DRY: Don’t Repeat Yourself
  • DRY principle := “Every piece of knowledge must have a single, unambiguous, authoritative representation within a system”.
  • By Andy Hunt and Dave Thomas book The Pragmatic Programmer : When DRY principle is applied successfully, a modification of any single element of a system does not require a change in other logically unrelated elements. Additionally, elements that are logically related all change predictably and uniformly, and are thus kept in sync.
  • Application : Do not copy-paste programming (not using the SO keyboard)

DRY vs WET solutions

  • Violations of DRY : typically referred to as WET solutions
  • “WET” stands for either
    1. “write everything twice”,
    2. “we enjoy typing” , or
    3. “waste everyone’s time”.
  • Copy-Paste Programming is bad
  • Question: do useRs and progRammers prefer WET solutions ??

Few remarks about style

  • Please1 use the “left arrow” for assignment: the nice and expressive <- instead of the = which has many other meanings in R. (R is a functional language: Assignments should stand out ):
    • Rstudio and ESS both have shortcut [Alt]- to produce ␣<-␣ (4 chars, incl. 2 spaces)

([1]. Martin Mächler; Hadley Wickham in http://adv-r.had.co.nz/Style.html])

  • Do comment your R code: You (“your future self”) will be glad already in 1 month

  • Use spaces in your R code .. and be better than pretty printers (e.g. from R’s source):

droplevels.data.frame <- function(x, except = NULL, exclude, ...)
  {
    ix <- vapply(x, is.factor, NA)
    if (!is.null(except)) ix[except] <- FALSE
    x[ix] <- if(missing(exclude))
                  lapply(x[ix], droplevels)
             else lapply(x[ix], droplevels, exclude=exclude)
    x
  }
  • Apart from data frames, there are also matrices; they are sometimes much more efficient

R Programming

3. Functions, Testing, Packages

Everything that happens in R …

John Chambers: In R,

  • Everything that exists
  •       is an Object
  • Everything that happens
  •       is a function call
  • You call functions all the time
  • ==> do write own functions all the time, e.g. inside other functions
  • Do learn about
lapply(X, FUN, ...)
do.call(<FUN>, args, ...)

Test (in) your functions

  • Can use
 if( <problem> )  stop(.....)
  • Nicer, as “assertion”:
    stopifnot( <must_be_1> , <must_be_2> ,  ...)
  • or alternatively, since R 3.5.0 (April 2018),
    feature added by the author of stopifnot() [= MMä @ Zürich] :
stopifnot(exprs = {
    <must_be_1>
    <must_be_2>
     ...
})

(and yes, I know there are about a dozen packages for testing …)

Put your functions in a package

  • and test them (even more …) thoroughly
  • write nice help pages including examples

  • write vignettes (using Rmarkdown or \(\LaTeX\)-based Sweave/Knitr) …

…. (building R packages that are sustainable — another lecture) …

4. New in R 3.5.0

and more in future R : ALTREP

NEWS - from new releases of R :

ALTREP

ALTernative object REPresentations

ALTREP – as already in R 3.5.0

Adapted from https://svn.r-project.org/R/branches/ALTREP/ALTREP.html#sample_implementations :

Compact Integer Vectors

  • Vectors n1:n2 , seq_along(n) and seq_len(m): represented compactly in terms of their start and end values.

In R 3.4.4 (and older):

## > x <- 1:1e10
## Error: cannot allocate vector of size 74.5 Gb

where in R 3.5.0++ and the ALTREP branch it works (very fast):

x <- 1:1e10

Compact Integer Vectors (2)

  • Implication: for loop over a large range of integers that stops early no longer needs to allocate and fill in the full vector: In R 3.5.0++ (and the ALTREP branch), this is instantaneous (a few ms), where in R 3.4.4 it needs about 1.5 sec:
system.time(for (i in 1:1e9) break)
##    user  system elapsed 
##   0.009   0.000   0.010

The .Internal(inspect()) function shows that a compact representation has been used:

.Internal(inspect(x))
## @8568c60 14 REALSXP g0c0 [MARK,NAM(3)]  1 : 10000000000 (compact)

Compact Integer Vectors (3)

sum(x) is very smart (using alleged Gauss-as-first-grader formula \(n(n+1)/2\)):

system.time(print(sum(x)))
## [1] 5e+19
##    user  system elapsed 
##       0       0       0
n <- length(x)
sum(x) == n*(n+1)/2
## [1] TRUE

and mean(x) has been adapted to extract (in blocks) and not expand x (but still is much too slow here)

Compact Integer Vectors (4)

The following R code is not even imaginable in regular R before R 3.5.0:

x <- 1:1e15
object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not really
## 8000000000000048 bytes
is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted
## [1] FALSE
xs <- sort(x)  # instantaneous !
.Internal(inspect(xs))
## @6992550 14 REALSXP g0c0 [NAM(3)]  1 : 1000000000000000 (compact)
anyNA(x)       # FALSE : i.e., R's *knows* it contains no NA's
## [1] FALSE
## (This does *NOT* fit on the slide:
try( l4 <- x < 4 ) # no way [in principle *could* be compacted]
x <- 10:1e15
.Internal(inspect(x))
## @7906580 14 REALSXP g0c0 [NAM(3)]  10 : 1000000000000000 (compact)

Deferred String Conversions

  • Conversion of numbers to strings is expensive
  • Happens (with rarely needed!) default row labels on design matrices (created as)
as.character(1 : nrow)

In ALTREP, the C-internal coerce() function returns a deferred string coercion. Examples:

i8 <- 1:1e8
ce8 <- as.character(i8)
.Internal(inspect(ce8))
## @687ee60 16 STRSXP g0c0 [NAM(3)]   <deferred string conversion>
##   @69e1988 13 INTSXP g0c0 [NAM(3)]  1 : 100000000 (compact)

Deferred String Conversions -2-

system.time(print(c35 <- ce8[1e7 + 3:5])) # very fast
## [1] "10000003" "10000004" "10000005"
##    user  system elapsed 
##       0       0       0
.Internal(inspect(ce8)) # unchanged
## @687ee60 16 STRSXP g1c0 [MARK,NAM(3)]   <deferred string conversion>
##   @69e1988 13 INTSXP g1c0 [MARK,NAM(3)]  1 : 100000000 (compact)
head(ce8)
## [1] "1" "2" "3" "4" "5" "6"
.Internal(inspect(ce8)) # unchanged
## @687ee60 16 STRSXP g1c0 [MARK,NAM(3)]   <deferred string conversion>
##   @69e1988 13 INTSXP g1c0 [MARK,NAM(3)]  1 : 100000000 (compact)

Deferred String Conversions -3-

.Internal(inspect(ce6 <- as.character(1 : 1e6))) # shorter version
## @6b426e8 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
##   @6b42640 13 INTSXP g0c0 [NAM(3)]  1 : 1000000 (compact)
system.time( ce6[1] <- "a" ) # 350 ms
##    user  system elapsed 
##   0.259   0.006   0.265
.Internal(inspect(ce6)) # now <expanded string conversion>
## @7f64160fd010 16 STRSXP g0c7 [NAM(1)] (len=1000000, tl=0)
##   @19f05a0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
##   @1c2d4f0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "2"
##   @19df868 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "3"
##   @37fffd0 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII] [cached] "4"
##   @1ba1488 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII] [cached] "5"
##   ...

Speedup for lm()

Benefit: significant speedup in lm() for fitting a model with large \(n\) (R 3.5.0++):

n <- 1e7 ;  x <- rnorm(n) ;  y <- rnorm(n)
system.time(lm(y ~ x))
##    user  system elapsed 
##   1.920   0.517   2.449
system.time(lm(y ~ x))
##    user  system elapsed 
##   1.428   0.506   1.942

Where the elapsed timings were 11.3 and 8.8 in R 3.4.4 (on same laptop).

Speedup: due entirely to not creating the row labels for the design matrix. Yes, have faster ways: .lm.fit(cbind(1, x), y) is \(3 \times\) faster.

More ALTREP ideas & development – Summary

Summary:

  • Martin M. \(\in\) R Core — R Core makes “base R”
  • The ART of R Programming : Use the Source, Luke!
  • R Programming with STYLE : Aim for DRY, not WET!
  • Functions, Testing, Packages : Write more functions, incl. 1-liners
  • New in R 3.5.0 (w/ more impact in the future): ALTernative REPresentations

     

  • That’s all Folks!
  • Questions, Remarks ?