What I find important when R Programming

May 15, 2018 @ eRum Budapest

R and Me: How I got envolved

History of R has been well described on Wikipedia
notably its ref. [19] :
I became involved in 1993;
1995: R became Free Software (licenced by GPL. FOSS:= Free and Open-Source),
… have been part of the R Core team since its formation in 1997.

R is made by R Core Team

R – made by R Core Team

https://www.r-project.org/contributors.html

… Since mid-1997 there has been a core group with write access to the R source, currently consisting of

Douglas Bates	John Chambers	Peter Dalgaard	Robert Gentleman	Kurt Hornik
Ross Ihaka	Tomas Kalibera	Michael Lawrence	Friedrich Leisch	Uwe Ligges
Thomas Lumley	Martin Maechler	Martin Morgan	Paul Murrell	Martyn Plummer
Brian Ripley	Deepayan Sarkar	Duncan Temple Lang	Luke Tierney	Simon Urbanek

plus Heiner Schwarte up to October 1999, Guido Masarotto up to June 2003, Stefano Iacus up to July 2014, Seth Falcon up to August 2015, and Duncan Murdoch up to September 2017.

R == "base" + packages

R : became "viral" thanks to CRAN's package system (Kurt Hornik and CRAN team):

R Programming

1. The ART of R Programming

Learn from the Masters

Obi-Wan Kenobi to Luke Skywalker: "Use the Force, Luke!" ('Starwars')

Use the Source!

Read R Source code (to learn from the Masters)

Instead of just "googling" and installing the package: Learn to "Use the source!" (a small effort, yes…): see Jenny Brian's https://github.com/jennybc/access-r-source
Download the R source, and package sources instead of just installing them.
- CRAN (and Bioconductor) should be your primary source,
- everthing else (github, …) only if you know that the authors and the website are trustworthy
Do read package sources, even before you write your first package. Hence, start to love files such as <mypkg>_<n.m>.tar.gz

R Programming

2. R Programming with STYLE

The Stackoverflow Keyboard

May 7, 2018 on the "I'm Programmer" page: https://www.improgrammer.net/stackoverflow-keyboard/

DRY, not WET !

From , the free encyclopedia

DRY: Don't Repeat Yourself

DRY principle := "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system".

By Andy Hunt and Dave Thomas book The Pragmatic Programmer
When DRY principle is applied successfully, a modification of any single element of a system does not require a change in other logically unrelated elements. Additionally, elements that are logically related all change predictably and uniformly, and are thus kept in sync.

DRY vs WET solutions

Violations of DRY : typically referred to as WET solutions

"WET" stands for either
1. "write everything twice",
2. "we enjoy typing" , or
3. "waste everyone's time".

Copy-Paste Programming is bad

Question: do useRs and progRammers prefer WET solutions ??

Few remarks about style

Please¹ use the "left arrow": the nice and expressive <- instead of the = which has many other meanings in R. (R is a functional language: Assignments should stand out ):
- Rstudio and ESS both have shortcut [Alt]- to produce $<- (4 chars, incl. 2 spaces)

([1]. Martin Mächler; Hadley Wickham in http://adv-r.had.co.nz/Style.html])

Do comment your R code: You ("your future self") will be glad already in 1 month
Use spaces in your R code .. and be better than pretty printers (e.g. from R's source):

droplevels.data.frame <- function(x, except = NULL, exclude, ...)
  {
    ix <- vapply(x, is.factor, NA)
    if (!is.null(except)) ix[except] <- FALSE
    x[ix] <- if(missing(exclude))
                  lapply(x[ix], droplevels)
             else lapply(x[ix], droplevels, exclude=exclude)
    x
  }

Apart from data frames, there are also matrices; they are sometimes much more efficient (e.g., Colin Gillespie's "Efficient R Programming")

R Programming

3. Functions, Testing, Packages

Everthing that happens in R …

John Chambers: In R,

Everything that exists
- is an Object

Everything that happens
- is a function call

You call functions all the time
==> do write own functions all the time, e.g. inside other functions

Do learn about

lapply(X, FUN, ...)
do.call(<FUN>, args, ...)

Test (in) your functions

Can use
```
if( <problem> )  stop(.....)
```

Nicer, as "assertion":

stopifnot( <must_be_1> , <must_be_2> ,  ...)

or alternatively, since R 3.5.0,

    stopifnot(exprs = {
      <must_be_1>
      <must_be_2>
      ...
    })

(and yes, I know there are about a dozen packages for testing …)

Put your functions in a package

and test them even more thoroughly
write nice help pages including examples
vignettes, …

…. (building R packages that are sustainable — another lecture) …

4. New in R 3.5.0

and more in future R : ALTREP

NEWS - from new releases of R :

Click "News" on R webpage: https://www.r-project.org/news.html
- linking to https://cran.r-project.org/doc/manuals/r-release/NEWS.html
- Many new features , notably
ALTREP (not prominent in NEWS) is remarkable and may have large impact, Documented in R's source branches/ALTREP. (or with extra exploration features from its github mirror.

ALTREP

ALTernative object REPresentations

At Stanford, 2016 after useR!, Gabe Becker talked about his ideas on the potential of alternative object representations and (R Core member) Luke Tierney agreed to form a working group as some of this had been floating around for quite some time
Provisional docu: https://svn.r-project.org/R/branches/ALTREP/ALTREP.html (Orig: Nov.2016; with updates up to Nov. 13, 2017)
another Talk of Luke (at Ross Ihaka's farewell conference, Dec. 2017): http://homepage.stat.uiowa.edu/~luke/talks/nzsa-2017.pdf

ALTREP – as already in R 3.5.0

Adapted from https://svn.r-project.org/R/branches/ALTREP/ALTREP.html#sample_implementations :

Compact Integer Vectors

Vectors n1:n2 , seq_along(n) and seq_len(m): represented compactly in terms of their start and end values.

In R 3.4.4 (and older):

## > x <- 1:1e10
## Error: cannot allocate vector of size 74.5 Gb

where in R 3.5.0++ and the ALTREP branch it works (very fast):

x <- 1:1e10

Compact Integer Vectors (2)

Implication: for loop over a large range of integers that stops early no longer needs to allocate and fill in the full vector: In R 3.5.0++ (and the ALTREP branch), this is instantaneous (a few ms), where in R 3.4.4 it needs about 1.5 sec:

system.time(for (i in 1:1e9) break)

##    user  system elapsed 
##   0.013   0.000   0.013

The .Internal(inspect()) function shows that a compact representation has been used:

.Internal(inspect(x))

## @8805628 14 REALSXP g0c0 [MARK,NAM(3)]  1 : 10000000000 (compact)

Compact Integer Vectors (3)

sum(x) is very smart (using alleged Gauss-as-first-grader formula $n(n+1)/2$):

system.time(print(sum(x)))

## [1] 5e+19

##    user  system elapsed 
##       0       0       0

n <- length(x)
sum(x) == n*(n+1)/2

## [1] TRUE

and mean(x) has been adapted to extract (in blocks) and not expand x (but still is much too slow here)

Compact Integer Vectors (4)

The following R code is not even imaginable in regular R before R 3.5.0:

x <- 1:1e15
object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not really

## 8000000000000048 bytes

is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted

## [1] FALSE

xs <- sort(x)  # instantaneous !
.Internal(inspect(xs))

## @6dce478 14 REALSXP g0c0 [NAM(3)]  1 : 1000000000000000 (compact)

anyNA(x)       # FALSE : i.e., R's *knows* it contains no NA's

## [1] FALSE

## (This does *NOT* fit on the slide:
try( l4 <- x < 4 ) # no way [in principle *could* be compacted]
x <- 10:1e15
.Internal(inspect(x))

## @7a4ae88 14 REALSXP g0c0 [NAM(3)]  10 : 1000000000000000 (compact)

Deferred String Conversions

Conversion of numbers to strings is expensive
Happens (with rarely needed!) default row labels on design matrices (created as)

as.character(1 : nrow)

In ALTREP, the C-internal coerce() function returns a deferred string coercion. Examples:

i8 <- 1:1e8
ce8 <- as.character(i8)
.Internal(inspect(ce8))

## @6d17440 16 STRSXP g0c0 [NAM(3)]   <deferred string conversion>
##   @6e202a8 13 INTSXP g0c0 [NAM(3)]  1 : 100000000 (compact)

Deferred String Conversions -2-

system.time(print(c35 <- ce8[1e7 + 3:5])) # very fast

## [1] "10000003" "10000004" "10000005"

##    user  system elapsed 
##       0       0       0

.Internal(inspect(ce8)) # unchanged

## @6d17440 16 STRSXP g1c0 [MARK,NAM(3)]   <deferred string conversion>
##   @6e202a8 13 INTSXP g1c0 [MARK,NAM(3)]  1 : 100000000 (compact)

head(ce8)

## [1] "1" "2" "3" "4" "5" "6"

.Internal(inspect(ce8)) # unchanged

## @6d17440 16 STRSXP g1c0 [MARK,NAM(3)]   <deferred string conversion>
##   @6e202a8 13 INTSXP g1c0 [MARK,NAM(3)]  1 : 100000000 (compact)

Deferred String Conversions -3-

.Internal(inspect(ce6 <- as.character(1 : 1e6))) # shorter version

## @6f9b938 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
##   @6f9b890 13 INTSXP g0c0 [NAM(3)]  1 : 1000000 (compact)

system.time( ce6[1] <- "a" ) # 350 ms

##    user  system elapsed 
##   0.320   0.016   0.337

.Internal(inspect(ce6)) # now <expanded string conversion>

## @7f306b7fc010 16 STRSXP g0c7 [NAM(1)] (len=1000000, tl=0)
##   @1f809a0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
##   @21b8d50 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "2"
##   @1f6fc68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "3"
##   @3d61c28 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII] [cached] "4"
##   @212ce68 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII] [cached] "5"
##   ...

Speedup for `lm()`

Benefit: significant speedup in lm() for fitting a model with large $n$ (R 3.5.0++):

n <- 1e7 ;  x <- rnorm(n) ;  y <- rnorm(n)
system.time(lm(y ~ x))

##    user  system elapsed 
##   1.860   0.507   2.383

system.time(lm(y ~ x))

##    user  system elapsed 
##   1.351   0.460   1.820

Where the elapsed timings were 11.3 and 8.8 in R 3.4.4 (on same laptop).

Speedup: due entirely to not creating the row labels for the design matrix.
Yes, have faster ways: .lm.fit(cbind(1, x), y) is $3 \times$ faster.

Summary:

Martin M. $\in$ R Core — R Core makes "base R"
The ART of R Programming : Use the Source, Luke!
R Programming with STYLE : Aim for DRY, not WET!
Functions, Testing, Packages : Write more functions, incl. 1-liners
New in R 3.5.0 (w/ more impact in the future): ALTernative REPresentations

That's all Folks!

Questions, Remarks ?

R and Me: How I got envolved

R is made by R Core Team

R – made by R Core Team

https://www.r-project.org/contributors.html

R == "base" + packages

R Programming

1. The ART of R Programming

Learn from the Masters

Use the Source!

Read R Source code (to learn from the Masters)

R Programming

2. R Programming with STYLE

The Stackoverflow Keyboard

DRY, not WET !

DRY vs WET solutions

Few remarks about style

R Programming

3. Functions, Testing, Packages

Everthing that happens in R …

Test (in) your functions

Put your functions in a package

4. New in R 3.5.0

and more in future R : ALTREP

NEWS - from new releases of R :

ALTREP

ALTernative object REPresentations

ALTREP – as already in R 3.5.0

Compact Integer Vectors

Compact Integer Vectors (2)

Compact Integer Vectors (3)

Compact Integer Vectors (4)

Deferred String Conversions

Deferred String Conversions -2-

Deferred String Conversions -3-

Speedup for lm()

Summary:

Speedup for `lm()`