%\VignetteIndexEntry{Introducing the float package: 32-Bit Floats for R} \documentclass[]{article} \input{./include/settings} \mytitle{Introducing the {float} package: 32-Bit Floats for {R}} \mysubtitle{} \myversion{0.2-4} \myauthor{ \centering Drew Schmidt \\ \texttt{wrathematics@gmail.com} } \newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}} \begin{document} \makefirstfew \section{Introduction}\label{introduction} R has a "numeric" type for vectors and matrices. This type must be either integer or double precision. As such, R has no real ability to work with 32-bit floats. However, sometimes single precision (or less!) is more than enough for a particular task. The \pkg{float} package~\cite{float} extends R's linear algebra facilities to include single precision (float) data. Float vectors/matrices have half the precision of their "numeric"-type counterparts, for a performance vs accuracy trade-off. The internal representation is an S4 class, which allows us to keep the syntax identical to that of base R's. Interaction between base types for binary operators is generally possible. In these cases, type promotion always defaults to the higher precision (more on this in Section~\ref{typepromotion}). The package ships with copies of the single precision 'BLAS' and 'LAPACK', which are automatically built in the event they are not available on the system. \subsection{Installation}\label{installation} You can install the stable version from CRAN using the usual \texttt{install.packages()}: \begin{lstlisting}[language=rr] install.packages("float") \end{lstlisting} The development version is maintained on GitHub. You can install this version using any of the well-known installer packages available to R: \begin{lstlisting}[language=rr] remotes::install_github("wrathematics/float") \end{lstlisting} Note that for best performance, you will need to build the package from source, either from the GitHub repository or from the CRAN source distribution. See Section~\ref{blasnlapack} for more details as to why a source installation is recommended. \subsection{BLAS and LAPACK Libraries}\label{blasnlapack} The linear algebra operations in the \pkg{float} package are handled by the BLAS and LAPACK~\cite{lawson1979basic,anderson1999lapack}. By default, R will not ship with the single precision versions of these functions, so we include a source code distribution within the package. This is the "reference" or NetLib implementation, which is not particularly efficient. Additionally, compiling these can take a very long time. To take advantage of the enhanced run-time performance and reduced compilation times of using tuned BLAS/LAPACK with \pkg{float}, you will need to choose an implementation and install it. Typical implementations include \pkg{OpenBLAS}~\cite{OpenBLAS}, Intel \pkg{MKL}~\cite{mkl}, AMD \pkg{ACML}~\cite{acml}, \pkg{Atlas}~\cite{atlas}, and Apple's \pkg{Accelerate}~\cite{accelerate}. You can read more about using different BLAS implementations with R in the R Installation and Administration manual~\cite{Rblas}. Once you switch BLAS implementations with R, you will need to rebuild the \pkg{float} package from source. \section{For Users} \subsection{Basics} R does not have a 32-bit float type (hence the package). You can cast your data from integer/numeric to float using \code{fl()} (you can also cast a float to a numeric via \code{dbl()}): \begin{lstlisting}[language=rr] library(float) x = matrix(1:9, 3) x ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 s = fl(x) s ## # A float32 matrix: 3x3 ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 \end{lstlisting} This will of course require 1.5x the memory of the input matrix (storing it as both a double and as a float). For workflows requiring many operations, the memory savings will still be substantial. At this time, we do not have a reader, so casting is the best way to go. However, once you cast the matrix to a float, you can serialize it as usual with \code{save()} and/or \code{saveRDS()}. For testing or other cases where random matrices are needed (e.g., PCA via random normal projections~\cite{halko2011finding}), we include several random generators. The functions \code{flrunif()} and \code{flrnorm()} are somewhat like R's \code{runif()} and \code{rnorm()} in that they produce vectors (but also matrices) of floating point random uniform/normal values: \begin{lstlisting}[language=rr] set.seed(1234) flrunif(5) ## # A float32 vector: 5 ## [1] 0.1137034 0.6222994 0.6092747 0.6233795 0.8609154 flrunif(2, 3) ## # A float32 matrix: 2x3 ## [,1] [,2] [,3] ## [1,] 0.640310585 0.2325505 0.5142511 ## [2,] 0.009495757 0.6660838 0.6935913 flrunif(5, min=10, max=20) ## # A float32 vector: 5 ## [1] 15.44975 12.82734 19.23433 12.92316 18.37296 \end{lstlisting} Arbitrary generators can be used with the \code{flrand()} interface. It behaves more like R's \code{runif()}, \code{rnorm()}, etc., except that it accepts a generator function for its first argument. For example: \begin{lstlisting}[language=rr] flrand(generator=rexp, n=5, rate=.1) ## # A float32 vector: 5 ## [1] 8.624105 6.745913 8.380404 7.604303 18.800766 flrand(function(n) sample(5, size=n, replace=TRUE), 5) ## # A float32 vector: 5 ## [1] 2 2 2 1 1 \end{lstlisting} This is conceptually similar to first generating \code{n} random values and then casting them over to floats, but more memory efficient. The function processes the generator data in 4KiB chunks (for double precision generators). \subsection{Arithmetic and Type Promotion}\label{typepromotion} Perhaps a mistake in hindsight, but floats and numeric vectors/matrices will interoperate with each other in binary arithmetic operations. So you can multiply \code{2L}, \code{2.0}, and \code{fl(2)} in any binary combination you like. But the output will be determined by the highest precision; in fact, the arithmetic itself will be carried out in the highest possible precision. So adding a float matrix with a double matrix is really just adding 2 double matrices together after casting the float up (which uses quite a bit of additional memory) and returning a double matrix. This even works for more complicated functions like \code{\%*\%} (matrix multiplication). For example: \begin{lstlisting}[language=rr] set.seed(1234) x = matrix(1:4, 2) y = flrunif(2, 2) x ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 y ## # A float32 matrix: 2x2 ## [,1] [,2] ## [1,] 0.1137034 0.6092747 ## [2,] 0.6222994 0.6233795 x %*% y ## # A float32 matrix: 2x2 ## [,1] [,2] ## [1,] 1.980602 2.479413 ## [2,] 2.716604 3.712067 storage.mode(x) = "double" x %*% y ## [,1] [,2] ## [1,] 1.980602 2.479413 ## [2,] 2.716605 3.712067 \end{lstlisting} Long story short, be careful when mixing types. \section{For Developers} \subsection{Basics} A \code{float32} matrix/vector is really a very simple S4 class. It has one slot, \code{@Data}, which should be an ordinary R integer vector or matrix. The values of that integer matrix will be interpreted as floats in the provided methods. If you wish to create your own method, say using C kernels or Rcpp~\cite{Rcpp}, then you will have to play the same game. More on that later. To create a \code{float32} object, use \code{float::float32()}: \begin{lstlisting}[language=rr] Data = 1:3 x = float32(Data) x ## # A float32 vector: 3 ## [1] 1.401298e-45 2.802597e-45 4.203895e-45 \end{lstlisting} To access the integer data of a \code{float32}, just grab the \code{@Data} slot: \begin{lstlisting}[language=rr] > x@Data ## [1] 1 2 3 \end{lstlisting} In general there's no relationship between the integer vs float interpretations of the values residing in the same block of memory, with the exception of 0: \begin{lstlisting}[language=rr] x = fl(0:3) x@Data ## [1] 0 1065353216 1073741824 1077936128 dbl(x+x) ## [1] 0 2 4 6 (x+x)@Data ## [1] 0 1073741824 1082130432 1086324736 x@Data + x@Data ## [1] 0 2130706432 NA NA ## Warning message: ## In x@Data + x@Data : NAs produced by integer overflow \end{lstlisting} So when creating new functionality not provided by existing \pkg{float} package methods, you will probably have to move to compiled code. \subsection{Compiled Code} Using 32-bit floats from \pkg{float} in compiled code is not terribly difficult, but maybe a bit annoying. The general way to proceed for a 32-bit float \code{x} is: \begin{itemize} \item Pass \code{x@Data} (an integer) to \code{.Call()} \item Inside the C/C++ function, use a \code{float} pointer to the integer data. \item Return from \code{.Call()} an integer vector/matrix. \item Put the return from \code{.Call()} (say \code{ret}) in the \code{float} S4 class: \code{float32(ret)}. \end{itemize} One can access the data with the \code{FLOAT()} macro. If writing an R package, add \pkg{float} to the \code{LinkingTo} list in the package DESCRIPTION file. Then add \code{\#include } and the macro will be available. If you are working outside the construct of a package (not recommended), then you can define the macro as follows: \begin{lstlisting}[language=cc] #define FLOAT(x) ((float*) INTEGER(x)) \end{lstlisting} This is the "\code{DATAPTR}" way of doing things, similar to \code{REAL()} for double precision and \code{INTEGER} for ints. There is no Rcpp-like idiom for floats similar to \code{NumericVector} and \code{NumericMatrix} at this time. Here's a basic example of how one would create a new function \code{add1()} (ignoring that we could just do \code{x+1}) using C. We will do this outside of a package framework for simplicity of demonstration, but again, it is recommended that you use the \code{LinkingTo} way mentioned above. \begin{lstlisting}[language=cc, title=add1.c] #include #include #define FLOAT(x) ((float*) INTEGER(x)) SEXP R_add1(SEXP x_) { SEXP ret; PROTECT(ret = allocVector(INTSXP, 1)); float *x = FLOAT(x_); FLOAT(ret)[0] = x[0] + 1.0f; UNPROTECT(1); return ret; } \end{lstlisting} Note that using \code{INTEGER(ret)[0]} instead of \code{FLOAT(ret)[0]} on line 12 above is not correct. That would first cast the value to an integer before storing the data. Then back at the R level, once put in the \code{float32} class, that integer value would be treated as though it were a \code{float}. If that explanation doesn't make sense, try modifying the above to the wrong thing and see what happens. We can build that function with \code{R CMD SHLIB add1.c}, and then call it via: \begin{lstlisting}[language=rr, title=add1.r] dyn.load("add1.so") library(float) add1 = function(x) { ret = .Call("R_add1", x@Data) float32(ret) } add1(fl(1)) ## # A float32 vector: 1 ## [1] 2 add1(fl(pi)) ## # A float32 vector: 1 ## [1] 4.141593 \end{lstlisting} Like I said, not really difficult, but annoying. \subsection{Linking and Additional Functions} \label{sec:linking} If you are writing C/C++ code on single precision vectors and matrices, there is a good chance that you will need to link with the \pkg{float} package. For sure if you want to efficiently do linear algebra (say via BLAS/LAPACK or you are using the float interface from Armadillo via \pkg{RcppArmadillo}~\cite{RcppArmadillo}), you will need to do this for CRAN safety\footnote{If you are just working on your own machine and linking with high-performance BLAS/LAPACK, then no linking is necessary. For portability, you need to link.}. To do this, you will need to set the \code{LDFLAGS} line of your \code{src/Makevars} file to include something like this: \begin{lstlisting} FLOAT_LIBS = `${R_HOME}/bin${R_ARCH_BIN}/Rscript -e "float:::ldflags()"` PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(FLOAT_LIBS) \end{lstlisting} By default, \code{float:::ldflags()} will try to dynamically link on Linux and Mac, but you can force static linking via \code{float:::ldflags(static=TRUE)}. In my opinion, dynamic linking is preferential, but you are free to make up your own mind about that. However, dynamic linking to an R package shared library is (I think?) impossible on Windows. So Windows will always statically link. In addition to BLAS and LAPACK symbols, there are a few helpers available. First, we include float values \code{NA_FLOAT} and \code{R_NaNf}, which are 32-bit analogues to \code{NA_REAL} and \code{R_NaN}. \begin{lstlisting}[language=c] int ISNAf(const float x); int ISNANf(const float x); \end{lstlisting} which you can find in the \code{float/float32.h} header. Finally, we also provide \code{float:::cppflags()} for the \code{PKG_CPPFLAGS} include flags. But using the \code{LinkingTo} field should usually be sufficient (i.e., you don't need it most of the time). One notable exception is if you are doing something goofy with a different compiler, like \code{nvcc} where you may need to explicitly pass the include flags. \section{Some Benchmarks} We will be examining two common applications from statistics which are dominated by linear algebra computations: covariance and principal components analysis. The setup for each of these benchmarks is: \begin{lstlisting}[language=rr] library(float) library(rbenchmark) set.seed(1234) reps = 5 cols = c("test", "replications", "elapsed", "relative") m = 7500 n = 500 x = matrix(rnorm(m*n), m, n) s = fl(x) \end{lstlisting} All benchmarks were performed using 2 cores of on an Intel Core i5-5200U (2.20GHz CPU) laptop running Linux and: \begin{itemize} \item gcc 7.2.0 \item R version 3.4.2 \item libopenblas 0.2.20 \end{itemize} Note that the benchmarks are highly dependent on the choice of BLAS library and hardware used. The cache sizes for this machine are: \begin{lstlisting}[language=rr] memuse::Sys.cachesize() ## L1I: 32.000 KiB ## L1D: 32.000 KiB ## L2: 256.000 KiB ## L3: 3.000 MiB \end{lstlisting} \subsection{Covariance} Since covariance is just the crossproducts matrix $x^Tx$ on mean-centered data, we can very easily create a custom covariance function: \begin{lstlisting}[language=rr] custcov = function(x) { s = scale(x, TRUE, FALSE) crossprod(s) / max(1L, nrow(x)-1) } \end{lstlisting} This function will work for numeric inputs as well as 32-bit floats. We can compare these two cases against R's internal covariance function: \begin{lstlisting}[language=rr] benchmark(custcov(x), custcov(s), cov(x), replications=reps, columns=cols) ## test replications elapsed relative ## 3 cov(x) 5 8.113 43.385 ## 2 custcov(s) 5 0.187 1.000 ## 1 custcov(x) 5 0.719 3.845 \end{lstlisting} R's \code{cov()} is clearly not designed with performance in mind. The performance difference between the \code{custcov(s)} (float) and \code{custcov(x)} (double) calls should only be about 2x. The higher performance we see is likely due to the fact that our implementation of \code{scale()} is better than R's. We can compare this to a highly optimized implementation of covariance, namely \code{covar()} from the \pkg{coop} package~\cite{coop}: \begin{lstlisting}[language=rr] benchmark(custcov(s), coop::covar(x), replications=reps, columns=cols) ## test replications elapsed relative ## 2 coop::covar(x) 5 0.358 1.817 ## 1 custcov(s) 5 0.197 1.000 \end{lstlisting} This looks more in line with what we would expect moving from double to single precision. \subsection{Principal Components Analysis} PCA is just SVD with some statistical window dressing: \begin{lstlisting}[language=rr] pca = function(x) { p = svd(scale(x, TRUE, FALSE), nu=0) p$d = p$d / max(1, sqrt(nrow(x) - 1)) names(p) = c("sdev", "rotation") p } \end{lstlisting} Once again, our function will work for both numeric inputs as well as 32-bit floats. We again compare the performance of these two cases against R's internal function (in this case, \code{prcomp()}): \begin{lstlisting}[language=rr] benchmark(pca(x), pca(s), prcomp(x), replications=reps, columns=cols) ## test replications elapsed relative ## 2 pca(s) 5 1.592 1.000 ## 3 pca(x) 5 3.663 2.301 ## 1 prcomp(x) 5 4.293 2.697 \end{lstlisting} Again, our improved \code{scale()} implementation is giving an edge (and possibly because \code{prcomp()} is doing more \emph{useful} work, as opposed to \code{cov()}\dots), although it is much less pronounced here since the SVD calculation is dominating. Indeed, the overall run time is roughly 10x higher here for the single precision PCA case compared to the single precision covariance calculation. \addcontentsline{toc}{section}{References} \bibliography{./include/float} \bibliographystyle{plain} \end{document}