\documentclass[12pt]{article} \usepackage{xspace,hyperref} \newcommand{\R}{\textbf{\textsf{R}}\xspace} \setlength{\parindent}{0pt} \setlength{\parskip}{6pt} \title{Stored Object Caches for \R} \author{Bill Venables,\\ CSIRO Mathematics, Informatics and Statistics\\ Brisbane, Australia} \date{December, 2013} % \VignetteIndexEntry{Stored Object Caches for R} \SweaveOpts{keep.source=true,strip.white=true} \begin{document} \maketitle \newpage \tableofcontents \section{Introduction} \label{sec:intro} When an object is created in \R, by default it remains in memory. If the objects are large, or numerous, memory management can become an issue even on a modern computer with large central memory. At the end of a session the objects in the global environment are usually kept in a single binary file in the working directory called \texttt{.RData}. When a new \R session begins with this as the initial working directory, the objects are loaded back into memory. If the \texttt{.RData} file is large startup may be slow, and the memory can be under pressure again right from the start of the session. Objects need not be always held in memory. The function \texttt{save} may be used to save objects on the disc in a file, typically with an \texttt{.RData} extension. The objects may then be removed from memory and later recalled explicitly with the \texttt{load} function. The \texttt{SOAR}\footnote{\texttt{SOAR} is an acronym for the four main functions of the package, \texttt{Store}, \texttt{Objects}, \texttt{Attach}, and \texttt{Remove}} package provides simple way to store objects on the disc, but in such a way that they remain visible on the search path as \emph{promises}, that is, if and when an object is needed again it is automatically loaded into memory. It uses the same \emph{lazy loading} mechanism as packages, but the functionality provided here is more dynamic and flexible. The \texttt{SOAR} package is based on an earlier package of David~Brahm called \texttt{g.data}. This earlier package was briefly described in \emph{R News}; see \cite{Rnews:Brahm:2002}. % @ % <>= % require("SOAR") % @ %def \section{Local stored object caches} \label{sec:localsoc} The \emph{working directory} for any \R session is the directory where by default \R will look for data sets or deposit text or graphical output. It can be found from within the session using the function \texttt{getwd()}. By a \emph{local stored object cache} we mean a directory, usually a sub-directory of the working directory, which will be used by \R to contain saved object \texttt{.RData} files. Users of \textbf{S-PLUS}, (a programming environment not unlike \R), will be familiar with the \texttt{.Data} sub-directory of the working directory which acts as a local stored object cache in precisely this sense. These caches are created and used by \R itself, not the user directly. To specify the way a cache works it is helpful to give an example. @ <>= ## attach the package, checking that it is installed stopifnot(require("SOAR")) ## create some dummy data X <- matrix(rnorm(1000*50), 1000, 50) S <- var(X) Xb <- colMeans(X) ## and store part of it in the default cache Store(X) @ %def At this point, if the initial workspace is empty, the global environment contains the two objects \texttt{S} and \texttt{Xb}, and the matrix is stored on the disc in a local stored object cache. @ <>= objects() ## or ls() find("X") @ %def Several things have happened. \begin{enumerate} \item A special sub-directory, called \verb|.R_Cache|, of the working directory has been created, if necessary, \item The object \texttt{X} has been saved in the cache as an \texttt{.RData} file, (with an encoded name, see later), and \emph{removed} from the global environment, \item An image of the cache has been placed at position~2 of the search path, also under the name ``\verb|.R_Cache|'', \item An object called ``\texttt{X}'' has been placed in position~2 of the search path which is really a promise to load the real \texttt{X} into position~2, as needed. \end{enumerate} The \texttt{SOAR} package provides a slightly enhanced version of the \texttt{search} function, for inspecting the search path. As well as the entries, it shows the enclosing directories, where applicable. Such an enclosing directory is called the \texttt{lib.loc} for packages and similar entries. We now continue the example:\footnote{Note that during the construction of a package vignette, such as this one, the newly formed package is installed in a temporary directory. This explains the unusual-looking \texttt{lib.loc} for \texttt{package:SOAR} below.} @ <<>>= Store(Xb, S) Search() Ls() # short alias for Objects() @ %def The \texttt{SOAR} function \texttt{Objects}\footnote{From version 0.99-9, \texttt{Ls} is an allowable alias for \texttt{Objects}, for the convenience of the typing-challenged.} is like the standard function \texttt{objects} (or equivalently \texttt{ls}), but applies only to stored object caches. In addition to listing objects, it locates the cache on the search path beforehand using both its name and \texttt{lib.loc}. As the position of the cache on the search path may wander as packages are added, and as there may be several caches on the search path under the same name, this is a useful feature. \subsection{Memory issues} \label{sec:memory} An important (but not the only) reason to use saved object caches is to release memory as much as possible. It is useful to keep track of memory usage as an object is stored and retrieved by various operations in \R, which we now do with another small, artificial example. Memory usage is largely tracked by looking at the number of ``\texttt{Vcells}'' currently in use. This is an element of the matrix which forms the output message of the garbage collector: @ <<>>= vc <- gc(); vc ; v0 <- vc["Vcells", "used"] @ %def At this point there are \Sexpr{vc[2,1]} \texttt{Vcells} in use. Now create a large \R object and manipulate it, keeping track of what happens to the \texttt{Vcells} in use, as a difference from this base level. We begin by defining a tiny function to pick out the relevant number from the \texttt{gc} output. @ <<>>= Vcells <- function() c(Vcells = gc()["Vcells", "used"]) ; Vcells()-v0 bigX <- matrix(rnorm(1000^2), 1000, 1000) ; Vcells()-v0 Store(bigX) ; Vcells()-v0 d <- dim(bigX) ; Vcells()-v0 bigX[1,1] <- 0 ; Vcells()-v0 Attach() ; Vcells()-v0 Store(bigX) ; Vcells()-v0 @ %def It is worth looking at this in some detail. \begin{itemize} \item Creating \texttt{bigX} in the global environment increases memory usage by about $10^6$ cells, as expected. \item Storing the object reduces the cells in use to around the base level again. \item Finding the dimension of \texttt{bigX} requires it to be loaded into the cache, honoring the promise to do so. Cell usage increases by about $10^6$ cells again. \item Modifying \texttt{bigX}, in this case replacing its first entry by zero, causes a \emph{second copy} of the object to be made in the global environment. Cell usage increases by a further $10^6$ cells. \item Using the \texttt{SOAR} function \texttt{Attach} re-establishes the cache on the search path. The object \texttt{bigX} at position~2 reverts to being a promise to re-load rather than the object itself, resulting in a \emph{reduction} in memory usage of about $10^6$ cells. \item Finally, storing the modified \texttt{bigX} takes it out of memory and places the new version in the cache. The promise is now to re-load the modified version and the original version is lost. \end{itemize} We can clear the cache on the disc using: @ <<>>= Remove(bigX) ; (Vcells()-v0) @ %def The effect on memory is insignificant as only the promise object has been removed. The main effect is to remove an \texttt{.RData} file from the disc. \subsection{Specifying objects for storage or removal} \label{sec:specifying} The functions \texttt{Store} and \texttt{Remove} take as their main arguments a specification of objects to be stored or removed respectively. There are four ways to do this, namely: \begin{enumerate} \item As \emph{unquoted names} of objects, such as \texttt{X}, \texttt{bigX} and the like. \item As \emph{explicit character strings} using any of the three quotation styles allowed in \R, namely single, double or backtick quotes, \item As \emph{an expression evaluating to a character string vector}, which gives the names of objects to be removed, e.g.\ \verb|objects(pattern = "^X")|, which would specify all objects whose name started with \texttt{X}. \item As a \emph{character string vector} value for the argument \texttt{list}. \end{enumerate} Hence @ <>= Store(objects()) @ %def when issued at the command line would store all objects in the global environment whose names did not begin with a period. This is a common idiom. An almost equivalent way to do this would be @ <>= objs <- objects() Store(list = objs) @ %def In this case, however, the object \texttt{objs} would remain in the global environment. Specification styles can be mixed, so a fully equivalent way would be @ <>= objs <- ls() Store(objs, list = objs) @ %def Also @ <>= Remove(Objects()) @ %def would remove from the cache all objects whose names did not begin with a period. This is seldom necessary but holds a certain appeal for some obsessively tidy minds. \subsection{\texttt{lib.loc} and \texttt{lib}} \label{sec:lib.loc.lib} All four functions, \texttt{Store}, \texttt{Objects}, \texttt{Attach} and \texttt{Remove} need to know where the cache is located on the disc. This is done in two parts, similar to the way that package locations are specified, namely by giving the \emph{enclosing directory}, known as the \texttt{lib.loc} and the \emph{cache name}, or \texttt{lib}. For local stored object caches, the default \texttt{lib} is usually \verb|.R_Cache|. More precisely, the default is either the value of the environment variable \verb|R_LOCAL_CACHE|, or \verb|.R_Cache| if this is unset. Environment variables for \R can be set in the \R session using \texttt{Sys.setenv}, or more conveniently (if the setting is intended to be generally made) by placing an entry in a \texttt{.Renviron} file in the user's home directory. Thus the following step from within \R itself @ <>= cat("\nR_LOCAL_CACHE=.R_Store\n", file = "~/.Renviron", append=TRUE) @ %def will add a line to the \texttt{.Renviron} in the user's home directory (or create one if none currently exists) which will change the default local cache name from ``\verb|.R_Cache|'' to ``\verb|.R_Store|'' permanently and generally for all four functions. Of course, \texttt{lib} names can be specified as additional named arguments in each of the functions, but some care needs to be given. For convenience, the \texttt{lib} argument may be given either as an \emph{unquoted name} or as an \emph{explicitly quoted} character string. Hence @ <>= Attach(lib = ".R_Store") Attach(lib = .R_Store) @ %def are equivalent, but @ <>= lib <- ".R_Store" Store(X, Y, Z, lib = lib) @ %def would store three objects in a local cache called literally ``\texttt{lib}''.\footnote{There is no particular reason to have the cache name begin with a period, but since they are NOT to be accessed directly by the user, doing so makes them conveniently out of sight in many contexts.} For local caches the \texttt{lib.loc} would normally be the current working directory, as given by \texttt{getwd()}, but users are free to vary this. The actual default is the value of the environment variable \verb|R_LOCAL_LIB_LOC| or \texttt{getwd()} if this is unset. Again the user may change the default for all four functions by an entry in the file \verb|~/.Renviron|, but this would be unusual. Since directory names are usually \emph{not} syntactic \R names, the option of specifying them in the argument list as unquoted names is not available. The main reason to change the \texttt{lib.loc} for a local cache is to add the objects from some other working directory cache to the present session. For example @ <>= Attach(lib.loc = "..") @ %def would attach the \verb|.R_Cache| directory from the parent of the current working directory, making those stored objects accessible to the present session as well. \section{Centrally stored object caches} \label{sec:central} In some cases objects can usefully be made available for multiple projects. One way to do this is to make the collection of objects into a package. Prior to making a formal package, though, an intermediate possibility is to store the objects in a cache directory and make it available in some easy way from any working directory on the same machine (or local area network). We term such caches as \emph{central} stored object caches, though they do not differ from local caches in any way other than in their preferred location. \subsection{\texttt{Data} and \texttt{Utils} variants} \label{sec:dataUtils} Each of the four primary functions as two variants, namely one with ``\texttt{Data}'' and the other with ``\texttt{Utils}'' added to the name. These are purely convenience functions which differ from the primary counterparts only in the default value for their \texttt{lib} and \texttt{lib.loc}. The default values for these are as follows: \begin{description} \item[\texttt{lib.loc}] For all variants, the default \texttt{lib.loc} is the value of the environment variable \verb|R_CENTRAL_LIB_LOC| or the user's home directory if this is not set. The user's home directory is found using \verb|path.expand("~")|. \item[\texttt{lib}] This differs for the two variant kinds. \begin{itemize} \item For \texttt{Data} variants the default \texttt{lib} is the value of the environment variable \verb|R_CENTRAL_DATA|, or \verb|.R_Data| if this is unset, and \item For \texttt{Utuls} variants the default \texttt{lib} is the value of the environment variable \verb|R_CENTRAL_UTILS|, or \verb|.R_Utils| if this is unset. \end{itemize} \end{description} The motivation for providing these variants is to give the user a convenient way of saving objects and making them generally visible across their \R session. We envisage that the \texttt{Data} forms will be used for data sets and the \texttt{Utils} forms for utility functions. One function we may wish to store and make generally available is the \texttt{Vcells} memory checking function we used above. A simple but useful function to have available is the converse of the binary operator \verb|%in%|, to identify which elements are \emph{not} members of the set. One way to make such an operator is: @ <>= `%ni%` <- Negate(`%in%`) @ %def If for some reason the user preferred to use \texttt{lsCache} instead of \texttt{Objects} a simple way to do this without butchering the source package\footnote{Users are free to butcher the source package, of course, but if you do so, please \emph{do not re-distribute it}, there's a sport.} would be @ <>= lsCache <- Objects @ %def These little utility functions may now be stored in the central utilities cache using: @ <>= StoreUtils(Vcells, `%ni%`, lsCache) @ %def Then in any future \R session @ <>= AttachUtils() @ %def would make \texttt{Vcells}, \verb|%ni%| and \texttt{lsCache}, in particular, available on demand.\footnote{As previously noted, \texttt{Ls} is now an allowable alias of \texttt{Objects}, and \texttt{LsUtils}, \texttt{LsData} for the variants.} Similarly @ <>= AttachData() @ %def would make the central data object cache visible and available on demand as well. Central data and utility object stores can be quite large without having an appreciable effect on memory, unless, of course, many large objects are required simultaneously. The motivation for using environment variables to specify the default values is purely one of convenience. The user's home directory may be an appropriate \texttt{lib.loc} for the centrally stored object caches, but many users would already have a reserved directory for \R related resources, in particular the add-on packages (as opposed to those which come with the release of \R itself). A suitable place for the centrally stored object caches might be alongside this package directory. Thus a typical \verb|~/.Renviron| might include entries such as \begin{verbatim} R_LIBS_USER=~/R/lib/library R_CENTRAL_LIB_LOC=~/R/lib/cache \end{verbatim} \subsection{Tricks with \texttt{.Rprofile}} \label{sec:rprofile} In addition to \texttt{.Renviron} if there is a file \texttt{.Rprofile} in the user's home directory it contains \R commands that are performed at startup for every \R session.\footnote{Unless there is a \texttt{.Rprofile} in the current working directory, which will override one in the home directory.} The \texttt{.Rprofile} is intended to customize the working \R environment, but this should be done with some care, particularly if the user is working as a member of a team and has to share \R scripts. It is very easy to make \R scripts that work in some customized contexts but fail in puzzling ways elsewhere where the customizations are different. For users who will want to use \texttt{SOAR} in most sessions it is inconvenient to have to remember to put \texttt{library(SOAR)} at the head of every script. One way round this is to add the line \begin{verbatim} options(defaultPackages = c(getOption("defaultPackages"), "SOAR")) \end{verbatim} to \verb|~/.Rprofile|. This will mean that \texttt{SOAR} is included in the list of packages to be loaded at startup. A more conditional way to add \texttt{SOAR} automatically to the search path is to use \R autoloads. If we add lines such as \begin{verbatim} autoload("Store", "SOAR") autoload("Objects", "SOAR") autoload("Ls", "SOAR") autoload("Attach", "SOAR") autoload("Remove", "SOAR") autoload("Search", "SOAR") \end{verbatim} to \verb|~/.Rprofile|, then if any of these five functions is used, the \texttt{SOAR} package is automatically attached to the search path, no questions asked. This is done by placing dummy \texttt{autoload} objects in the \texttt{Autoload} entry on the search path. These are promises like delayed assignments, but rather than loading objects into memory on demand, they attach the specified package on to the search path when the function in question is invoked. We can add further lines to \verb|~/.Rprofile| such as: \begin{verbatim} SOAR::AttachUtils() \end{verbatim} which will ensure that the central utilities cache is part of the search path at startup, \emph{without} attaching the \texttt{SOAR} package itself (though it is ``loaded'' rather than being attached). We need the double colon construction here to let the interpreter know where to fine the function \texttt{AttachUtils}. Note that once a cache has been attached to the search path it is not necessary for the \texttt{SOAR} package also to be attached for it to work. With the autoloads in place, however, an invocation of any of the five functions will cause the package to be attached. For the more intrepid user it is easy to make and add a full list of possible autoloads for any package and add it to \verb|~/.Rprofile| from within \R itself using, for example: @ <>= if(require("SOAR")) { lst <- paste('autoload("', objects("package:SOAR"), '", "SOAR")\n', sep="") cat("\n", lst, sep="", file = "~/.Rprofile", append = TRUE) } @ %def Now using \emph{any} (exported) function from \texttt{SOAR} would cause the package to be loaded automatically, if not already. (This sort of dodge can rapidly cause \texttt{.Rprofile} to become very long and unwieldy, of course!) \section{Tips, warnings and gratuitous advice} \label{sec:gratuitous} \subsection{Scripts or saved data objects?} \label{sec:scripts} While it is convenient to carry objects over from one session to another, particularly during the period where an analysis is being developed, it can be a mistake to rely on \R data objects gradually severing the link with the primary sources of the data. We would encourage users to make and keep scripts which construct all important data sets and analyses from primary sources and to be able to re-construct the entire process from them. Stored object caches are a convenience intended to provide a way of managing memory, primarily, but also for sharing objects between sessions, and between projects. Objects placed in the central utilities directory would usually include functions on test prior to collecting them into coherent groups and making packages from them. As yet there is no neat way to document the objects in stored object caches, which is one reason to aim for packages as a more satisfactory and permanent way of holding such information. \subsection{Limited lifetime of some \texttt{.RData}-saved objects} \label{sec:limit-lifet-some} We should also note that for some kinds of objects, \texttt{.RData} file versions may not remain viable indefinitely. This is particularly true for \emph{fitted model objects}. Frequently packages can be updated in a way that requires a change to the structure of fitted model objects. In this case, while scripts should remain workable, saved fitted model objects created under previous versions of the software will usually not. (The spineless option of never updating \texttt{\textbf{R}} or any of the add-on packages is also to be discouraged!) \subsection{\textsc{Windows} and other traps} \label{sec:traps} It is important to realise the \texttt{lib} and \texttt{lib.loc} both specify directories, or `folders' for the operating system. Some operating systems have rather arbitrary and sometimes arcane restrictions on the file names which are allowed. Consider the following artificial example, due to Nick Ellis: @ <>= x <- 1 Store(x, lib = ".A") x <- 2 Store(x, lib = ".a") @ %def On most operating systems this will, if necessary, create two local caches named \texttt{".A"} and \texttt{".a"} and store the value \texttt{1} for \texttt{x} in the first and the value \texttt{2} for \texttt{x} in the second. On the search path the second will sit ahead of the first, of course. On \textsc{Windows}, because file and folder names are \emph{case insensitive} the result will be quite different. If the local cache \texttt{".A"} is not already in existence the first \texttt{Store} will create it and cache the value \texttt{1} for \texttt{x} in it, as expected. The second \texttt{Store} will then attach a cache \texttt{".a"} to the search path and store the value \texttt{2} for \texttt{x}. Because of case insensitivity, however, the cache \texttt{".A"} and \texttt{".a"} will define \emph{\textbf{the same cache folder}}, and hence the second \texttt{Store} will replace the value for \texttt{x} stored in the first \texttt{Store} operation. There will be two entries on the search path labelled \texttt{".A"} and \texttt{".a"}, but they will effectively point to the same cache. One way round this might have been to encode the folder name for the cache on the disc, but as these cache names are used in code and are to be seen, occasionally, by the user, having a disparity between what the user sees in the code and what she or he sees in the file system will introduce another potential difficulty. Working with multiple local caches in the same working directory is a perfectly natural and often useful thing to do. Users working on systems \emph{other than \textsc{Windows}} can do so with \emph{relative} impunity, but \textsc{Windows} users need to be constantly aware of the limitations placed on file names in that operating system, and their consequences. The problem is not even entirely confined to \textsc{Windows}. For example on some, but not all, external memory devices and memory sticks a case insensitive file name system seems to be imposed, even under \textsc{Linux}.\footnote{On such devices, however, it is still possible to coin file names under \textsc{Linux}, for example, that are illegal names under \textsc{Windows}, such as the example we often cite here, ``\texttt{con}''. Such files then become inaccessible when the external memory device is used with a \textsc{Windows} operating system.} To minimise the possibility of this kind of adverse outcome, then, we strongly recommend that on \emph{all} operating systems users adopt some cache naming protocol that will guard against it. For example, the default names for caches are all ``capitalized'', such as \verb|.R_Cache| or \verb|.R_Utils|. If caches are \emph{always} named and referred to by such a scheme, the problem of case insensitivity will not arise. \newpage \appendix \section{Some technical details} \label{sec:technical} \paragraph*{Structure of the cache.} A cache directory consists of \texttt{.RData} files each corresponding to a single stored \R object. The name of the file is related to the name of the object itself, but is \emph{encoded} as some file systems have restrictions on file names. For example in \textsc{Windows} file names are case insensitive whereas in \R object names are case sensitive. Details of the encoding are given below. Users are strongly advised \emph{not} to access the files in the cache directory other than through \R. Manual changes to the cache, and in particular, extra files or sub-directories in the cache directory will almost certainly cause problems when the cache is used again by the \texttt{SOAR} package, from which recovery may be very difficult. \paragraph*{Operation of the cache.} When an existing cache is attached to the search path, as may be done explicitly by \texttt{Attach} or implicitly with \texttt{Store}, \texttt{Objects} or \texttt{Remove}, the following steps take place. \begin{itemize} \item The names of the files in the cache directory, apart from ``\texttt{.}'' and ``\texttt{..}'', are decoded into object names, \item An initially empty \R environment is attached to the search path, into which objects of the same name as those stored in the cache are placed. These objects are promises, created by calls to \texttt{delayedAssign}, to load the corresponding entire object into the environment on demand. \end{itemize} Any reference to an object in the cache thus precipitates a \texttt{load} operation, bringing the entire object into central memory and replacing the promise in the attached cache. Any change to an object in the cache causes a further copy of it to be loaded into the appropriate working environment to accept the changes. If, for example, the object is changed at the command line, this extra copy will be in the global environment. Using \texttt{Attach} to re-attach a cache will reinstate all objects as promises again, thus freeing any memory that has been taken up by automatic loading of entire objects. Storing an object with \texttt{Store} will (by default) remove it from the current working environment, store it in the cache directory and reinstate the object in the attached cache as a promise. Removing objects from a cache using \texttt{Remove} clears both the promise from the attached cache \emph{and} the corresponding \texttt{.RData} file from the cache directory. Thus the removal is permanent. \paragraph*{Birth and death of caches.} If reference is made by any of the four main functions to a cache that does not presently exist, the cache directory is created (or, more precisely \emph{will be created} when any object is stored there) and an empty cache is attached to the search path. If, however, \texttt{Attach} is used for this purpose a warning is issued that an empty cache has been attached. This is because it is never necessary to create a cache with \texttt{Attach}, so the user has most likely made a typo. If all objects are removed from a cache using, for example, @ <>= Remove(Objects()) @ %def the cache directory is \emph{not} removed. Removing empty cache directories, if need be, should be done using normal file system operations outside \R itself. \paragraph*{File name encoding.} The names for \texttt{.RData} files in the cache directory are encoded from the names of the objects themselves as follows: \begin{itemize} \item We assume that object names consist only of printable characters. \item Lower case letters are encoded as themselves. \item Upper case letters are encoded by preceding them with an \verb|@| character. \item There are 10 other characters which are known to be problematical if used in file names on some operating systems. These are encoded as \verb|@0|, \dots, \verb|@9|. The correspondence is as shown below in \R code output: @ <>= bad <- c(" ", "<", ">", ":", "\"", "/", "\\", "|", "?", "*") rpl <- paste("@", 0:9, sep = "") out <- rbind("Code:" = rpl, "Character:" = bad) colnames(out) <- rep(" ", 10) noquote(format(out, justify = "right")) rm(bad, rpl, out) @ %def \item The \verb|@| character itself is encoded as \verb|@@|. \item All other printable characters, including the digits, are encoded as themselves. \item Finally the extension tag ``\verb|@.RData|'' is added to the name without further \verb|@|-modification.\footnote{As well as being useful, this turns out to be necessary on \textsc{Windows} where some very simple file names, such as e.g.\ ``\texttt{con}'' are actually illegal, \emph{even if given an extension}.} \end{itemize} This rather simple encoding has proved to be adequate for all genuine cases, at least in UTF-8 locales. It has the virtue that most \R objects in the cache can be easily recognised from the file name at a glance, which was a distinct advantage during debugging. \section{Some packages with a similar functionality} \label{sec:similarPackages} As mentioned previously, David Brahm's \texttt{g.data} package (\cite{Rnews:Brahm:2002}) was antecedent to the present package. It offers effectively the same functionality as \texttt{SOAR}, but the usage is rather different. Users may wish to compare the two. \subsection{The \texttt{filehash} package} \label{sec:filehash-package} Roger Peng's package \texttt{filehash}, which provides \R with hash files also allows objects to be stored on the disc and recalled automatically as needed. It uses the \texttt{makeActiveBinding}\footnote{See \texttt{help("makeActiveBinding")} or \texttt{?makeActiveBinding} in an \R session for more information.} mechanism alongside the \texttt{delayedAssign} mechanism used by \texttt{SOAR} and \texttt{g.data}, which has some advantages, but at a slightly increased overhead time cost. There are strengths and weaknesses in both approaches and future versions of SOAR may offer a \texttt{makeActiveBinding} mechanism as an alternative to \texttt{delayedAssign} particularly for very large objects. For a discussion of the \texttt{filehash} package, see~\cite{Rnews:Peng:2006}. \subsection{The \texttt{mvbutils} package} \label{sec:mvbutils-package} Mark Bravington's package \texttt{mvbutils}, (See~\cite{bravington2011}), also offers a \texttt{makeActiveBinding} mechanism to cache objects through the function \texttt{mlazy}. Users should consult the help information for \texttt{mvbutils} for further details. This package has not yet been formally described in any published article, but his article on his \texttt{debug} package, originally part of \texttt{mvbutils}, can be found in \emph{R News}. See~\cite{Rnews:Bravington:2003}). \subsection{The packages \texttt{track} and \texttt{track.ff}} \label{sec:pack-track-track.ff} Tony Plate's packages \texttt{track} (on CRAN; see~\cite{plate2011}) and \texttt{track.ff} (at the time of writing still under development but available on R-Forge; see~\cite{plate2011a}) provide yet another protocol for holding objects on disc and automatically ``track''-ing them during computations. Like the \texttt{mvtutils} function \texttt{mlazy}, it uses the ``active binding'' mechanism to ensure that, when tracking is turned on, the objects in an environment are permanently held out of memory and not only retrieved when needed, as in the case of \texttt{SOAR}, but automatically returned to the after any modifications are made. In effect the objects are kept, as much as possible, \emph{permanently} on the disc as was the case in the original \textbf{S} computation model (and hence \textbf{S-PLUS}). The environment in which reference to them are mede, however, still retains the names of the objects as if present. The main differences in usage with \texttt{SOAR} are then \begin{itemize} \item With \texttt{SOAR} objects are manually nominated for out of memory storage and brought back into memory if an when needed. It is the responsibility of the user to return them to out of memory storage when no longer needed. With \texttt{track}, by default, all objects are effectively permanently stored out of memory when the mechanism is turned on. This is more convenient than the \texttt{SOAR} mechanism, but incurs a slight overhead in performance. \item With \texttt{SOAR}, out of memory stored objects are made visible typically at postion~2 of the search path. This convenient for allowing, for example, the global environment to remain relatively uncluttered while still working with large numbers of objects. With \texttt{track} the stored objects all remain visible within the same environment. Whether this is a positive or a negative feature is largely a matter of taste. \item With \texttt{SOAR}, if a stored object is modified, the modified version will be held in the working environment \emph{in addition to} the retrieved copy from the cache, (unless the cache is manually re-attached). This allows trial or temporary modifications to be made to an object, retaining a backup copy until the changed version is itself stored, but at the cost of possibly keeping two versions in memory. With \texttt{track} changes made to objects are permanent as the changed versions are automatically returned to the cache as soon as feasible\footnote{This difference is mostly a consequence of the previous two, of course.}. \end{itemize} \subsection{The \texttt{R.cache} packege} \label{sec:r.cache-packege} Henrik Bengtsson has kindly drawn to my attention the \texttt{R.cache} package, (see~\cite{Bengtsson2011}), which offers yet another related method of working with objects out of memory in collections called \emph{caches}, as do we. The usage of \texttt{R.cache} is rather different from that of \texttt{SOAR}, though. It is more specialised for making possible operations on very large objects, whereas \texttt{SOAR} has the management of large numbers of small objects as an important, if secondary design goal. Users may wish to compare. For the special problem of dealing with large, very large or even huge objects, there is a special section in the \R task view on high performance computing called \emph{Large memory and out-of-memory data} which focuses on the topic and lists many more key packages. (Task views can be found on the \textsc{CRAN} website \url{cran.r-project.org}.) % \section{Links with \texttt{ASOR}: help for old friends} % \label{sec:ASOR} % A precursor to the \texttt{SOAR} package was the package % \texttt{ASOR}, which was never released through CRAN, but has been in % fairly widespread trial use for some time. The name-change was made % for the officially released package to draw attention to the fact that % there are some important differences between it and \texttt{ASOR}, % though there is a large degree of backward % compatibility.\footnote{Another reason to change the name is that % speakers with a sufficiently broad Australian accent used to % pronounce the old package name as ``eyesore''.} % The main differences with \texttt{ASOR} are as follows. % \begin{itemize} % \item There has been an extension to the file name encoding, as % described in Appendix~\ref{sec:technical} above. This was needed to % overcome some deficiencies in the old encoding leading to failures. % For example objects such as \verb|con| and \verb|foo<-| can now be % cached on \textsc{Windows} which was not the case under the % \texttt{ASOR} encoding. % \textbf{IMPORTANT NOTE:} If a cache created under \texttt{ASOR} is % attached, directly or indirectly with \texttt{SOAR}, \emph{the file % names will be re-encoded under the new scheme}, with a warning % that this is taking place. At this point the cache will not be % readable with \texttt{ASOR} functions: there is no easy road back. % However this is a quick and simple operation and has proved to be % very reliable. Nevertheless users should take to heart the advice % given in sub-section~\ref{sec:scripts} on page~\pageref{sec:scripts} % and make sure that they have scripts available to re-create all % important \R objects rather than relying solely on stored object % caches. % \item There has been a change to the default local stored object cache % from \verb|.R_Store| to \verb|.R_Cache|. This is partly to reflect % the change in terminology, but also to make it easier for people to % operate with \texttt{ASOR} for a while longer while feeling their % way with \texttt{SOAR}, if they so wish. It could become very % confusing if both \texttt{ASOR} and \texttt{SOAR} packages were in % use in the same \R session. Users are warned against this. % There has been effectively no change to the default \texttt{lib} % names for the centrally stored object caches.\footnote{The % \texttt{Data} and \texttt{Utils} variants did exist in % \texttt{ASOR} but were largely unadvertized features and to the % author's knowledge, the author was the only person to use them.} % \item There has been a change to the way the default \texttt{lib} and % \texttt{lib.loc} names are specified, now using environment % variables. This is a more flexible system than the previous one and % offers a way for users to prescribe their own preferences in this % regard in a simple and global way. % \item The functions \texttt{Save}, \texttt{SaveData} and % \texttt{SaveUtils} have been removed. These were complete aliases % for the corresponding forms with \texttt{Store}. This is because % there is already a function \texttt{Save} in the \texttt{Hmisc} % package.\footnote{Originally \texttt{ASOR} only had the % \texttt{Save} forms, but the \texttt{Store} forms were added as a % preferable alternative when the author became aware of the clash % with \texttt{Hmisc}. The \texttt{Save} forms have now passed % effectively from deprecated to defunct.} % \item The function \texttt{Search} has been added, mainly to provide a % way for users to separate multiple stored object caches on the % search path with the same primary name, but with different % \texttt{lib.loc}s. % \end{itemize} \bibliographystyle{plain} \bibliography{refs} \end{document} % LocalWords: backtick customisation recognised