May 15, 2018 @ eRum Budapest

R and Me: How I got envolved

  • History of R has been well described on Wikipedia
  • notably its ref. [19] :
  • I became involved in 1993;
  • 1995: R became Free Software (licenced by GPL. FOSS:= Free and Open-Source),
  • … have been part of the R Core team since its formation in 1997.

R is made by R Core Team

R – made by R Core Team

https://www.r-project.org/contributors.html

… Since mid-1997 there has been a core group with write access to the R source, currently consisting of

Douglas Bates John Chambers Peter Dalgaard Robert Gentleman Kurt Hornik
Ross Ihaka Tomas Kalibera Michael Lawrence Friedrich Leisch Uwe Ligges
Thomas Lumley Martin Maechler Martin Morgan Paul Murrell Martyn Plummer
Brian Ripley Deepayan Sarkar Duncan Temple Lang Luke Tierney Simon Urbanek

plus Heiner Schwarte up to October 1999, Guido Masarotto up to June 2003, Stefano Iacus up to July 2014, Seth Falcon up to August 2015, and Duncan Murdoch up to September 2017.

R == "base" + packages

R : became "viral" thanks to CRAN's package system (Kurt Hornik and CRAN team):

R Programming

1. The ART of R Programming

Learn from the Masters

Obi-Wan Kenobi to Luke Skywalker: "Use the Force, Luke!" ('Starwars')

Use the Source!

Read R Source code (to learn from the Masters)

  1. Instead of just "googling" and installing the package: Learn to "Use the source!" (a small effort, yes…): see Jenny Brian's https://github.com/jennybc/access-r-source

  2. Download the R source, and package sources instead of just installing them.
    • CRAN (and Bioconductor) should be your primary source,
    • everthing else (github, …) only if you know that the authors and the website are trustworthy
  3. Do read package sources, even before you write your first package. Hence, start to love files such as <mypkg>_<n.m>.tar.gz

R Programming

2. R Programming with STYLE

The Stackoverflow Keyboard

DRY, not WET !

From , the free encyclopedia

  • DRY: Don't Repeat Yourself
  • DRY principle := "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system".
  • By Andy Hunt and Dave Thomas book The Pragmatic Programmer
  • When DRY principle is applied successfully, a modification of any single element of a system does not require a change in other logically unrelated elements. Additionally, elements that are logically related all change predictably and uniformly, and are thus kept in sync.

DRY vs WET solutions

  • Violations of DRY : typically referred to as WET solutions
  • "WET" stands for either
    1. "write everything twice",
    2. "we enjoy typing" , or
    3. "waste everyone's time".
  • Copy-Paste Programming is bad
  • Question: do useRs and progRammers prefer WET solutions ??

Few remarks about style

  • Please1 use the "left arrow": the nice and expressive <- instead of the = which has many other meanings in R. (R is a functional language: Assignments should stand out ):
    • Rstudio and ESS both have shortcut [Alt]- to produce $<- (4 chars, incl. 2 spaces)

([1]. Martin Mächler; Hadley Wickham in http://adv-r.had.co.nz/Style.html])

  • Do comment your R code: You ("your future self") will be glad already in 1 month

  • Use spaces in your R code .. and be better than pretty printers (e.g. from R's source):

droplevels.data.frame <- function(x, except = NULL, exclude, ...)
  {
    ix <- vapply(x, is.factor, NA)
    if (!is.null(except)) ix[except] <- FALSE
    x[ix] <- if(missing(exclude))
                  lapply(x[ix], droplevels)
             else lapply(x[ix], droplevels, exclude=exclude)
    x
  }
  • Apart from data frames, there are also matrices; they are sometimes much more efficient (e.g., Colin Gillespie's "Efficient R Programming")

R Programming

3. Functions, Testing, Packages

Everthing that happens in R …

John Chambers: In R,

  • Everything that exists
    • is an Object
  • Everything that happens
    • is a function call
  • You call functions all the time
  • ==> do write own functions all the time, e.g. inside other functions
  • Do learn about
lapply(X, FUN, ...)
do.call(<FUN>, args, ...)

Test (in) your functions

  • Can use

    if( <problem> )  stop(.....)
  • Nicer, as "assertion":

    stopifnot( <must_be_1> , <must_be_2> ,  ...)

or alternatively, since R 3.5.0,

    stopifnot(exprs = {
      <must_be_1>
      <must_be_2>
      ...
    })

(and yes, I know there are about a dozen packages for testing …)

Put your functions in a package

  • and test them even more thoroughly
  • write nice help pages including examples
  • vignettes, …

…. (building R packages that are sustainable — another lecture) …

4. New in R 3.5.0

and more in future R : ALTREP

NEWS - from new releases of R :

ALTREP

ALTernative object REPresentations

ALTREP – as already in R 3.5.0

Adapted from https://svn.r-project.org/R/branches/ALTREP/ALTREP.html#sample_implementations :

Compact Integer Vectors

  • Vectors n1:n2 , seq_along(n) and seq_len(m): represented compactly in terms of their start and end values.

In R 3.4.4 (and older):

## > x <- 1:1e10
## Error: cannot allocate vector of size 74.5 Gb

where in R 3.5.0++ and the ALTREP branch it works (very fast):

x <- 1:1e10

Compact Integer Vectors (2)

  • Implication: for loop over a large range of integers that stops early no longer needs to allocate and fill in the full vector: In R 3.5.0++ (and the ALTREP branch), this is instantaneous (a few ms), where in R 3.4.4 it needs about 1.5 sec:
system.time(for (i in 1:1e9) break)
##    user  system elapsed 
##   0.013   0.000   0.013

The .Internal(inspect()) function shows that a compact representation has been used:

.Internal(inspect(x))
## @8805628 14 REALSXP g0c0 [MARK,NAM(3)]  1 : 10000000000 (compact)

Compact Integer Vectors (3)

sum(x) is very smart (using alleged Gauss-as-first-grader formula \(n(n+1)/2\)):

system.time(print(sum(x)))
## [1] 5e+19
##    user  system elapsed 
##       0       0       0
n <- length(x)
sum(x) == n*(n+1)/2
## [1] TRUE

and mean(x) has been adapted to extract (in blocks) and not expand x (but still is much too slow here)

Compact Integer Vectors (4)

The following R code is not even imaginable in regular R before R 3.5.0:

x <- 1:1e15
object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not really
## 8000000000000048 bytes
is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted
## [1] FALSE
xs <- sort(x)  # instantaneous !
.Internal(inspect(xs))
## @6dce478 14 REALSXP g0c0 [NAM(3)]  1 : 1000000000000000 (compact)
anyNA(x)       # FALSE : i.e., R's *knows* it contains no NA's
## [1] FALSE
## (This does *NOT* fit on the slide:
try( l4 <- x < 4 ) # no way [in principle *could* be compacted]
x <- 10:1e15
.Internal(inspect(x))
## @7a4ae88 14 REALSXP g0c0 [NAM(3)]  10 : 1000000000000000 (compact)

Deferred String Conversions

  • Conversion of numbers to strings is expensive
  • Happens (with rarely needed!) default row labels on design matrices (created as)
as.character(1 : nrow)

In ALTREP, the C-internal coerce() function returns a deferred string coercion. Examples:

i8 <- 1:1e8
ce8 <- as.character(i8)
.Internal(inspect(ce8))
## @6d17440 16 STRSXP g0c0 [NAM(3)]   <deferred string conversion>
##   @6e202a8 13 INTSXP g0c0 [NAM(3)]  1 : 100000000 (compact)

Deferred String Conversions -2-

system.time(print(c35 <- ce8[1e7 + 3:5])) # very fast
## [1] "10000003" "10000004" "10000005"
##    user  system elapsed 
##       0       0       0
.Internal(inspect(ce8)) # unchanged
## @6d17440 16 STRSXP g1c0 [MARK,NAM(3)]   <deferred string conversion>
##   @6e202a8 13 INTSXP g1c0 [MARK,NAM(3)]  1 : 100000000 (compact)
head(ce8)
## [1] "1" "2" "3" "4" "5" "6"
.Internal(inspect(ce8)) # unchanged
## @6d17440 16 STRSXP g1c0 [MARK,NAM(3)]   <deferred string conversion>
##   @6e202a8 13 INTSXP g1c0 [MARK,NAM(3)]  1 : 100000000 (compact)

Deferred String Conversions -3-

.Internal(inspect(ce6 <- as.character(1 : 1e6))) # shorter version
## @6f9b938 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
##   @6f9b890 13 INTSXP g0c0 [NAM(3)]  1 : 1000000 (compact)
system.time( ce6[1] <- "a" ) # 350 ms
##    user  system elapsed 
##   0.320   0.016   0.337
.Internal(inspect(ce6)) # now <expanded string conversion>
## @7f306b7fc010 16 STRSXP g0c7 [NAM(1)] (len=1000000, tl=0)
##   @1f809a0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
##   @21b8d50 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "2"
##   @1f6fc68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "3"
##   @3d61c28 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII] [cached] "4"
##   @212ce68 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII] [cached] "5"
##   ...

Speedup for lm()

Benefit: significant speedup in lm() for fitting a model with large \(n\) (R 3.5.0++):

n <- 1e7 ;  x <- rnorm(n) ;  y <- rnorm(n)
system.time(lm(y ~ x))
##    user  system elapsed 
##   1.860   0.507   2.383
system.time(lm(y ~ x))
##    user  system elapsed 
##   1.351   0.460   1.820

Where the elapsed timings were 11.3 and 8.8 in R 3.4.4 (on same laptop).

Speedup: due entirely to not creating the row labels for the design matrix.
Yes, have faster ways: .lm.fit(cbind(1, x), y) is \(3 \times\) faster.

Summary:

  • Martin M. \(\in\) R Core — R Core makes "base R"
  • The ART of R Programming : Use the Source, Luke!
  • R Programming with STYLE : Aim for DRY, not WET!
  • Functions, Testing, Packages : Write more functions, incl. 1-liners
  • New in R 3.5.0 (w/ more impact in the future): ALTernative REPresentations

 

  • That's all Folks!
  • Questions, Remarks ?