[R-pkgs] new package 'trackObjs' - mirror objects to files, provide summaries & modification times
Tony Plate
tplate at acm.org
Mon Sep 10 23:16:28 CEST 2007
From ?trackObjs:
Overview of trackObjs package
Description:
The trackObjs package sets up a link between R objects in memory
and files on disk so that objects are automatically resaved to
files when they are changed. R objects in files are read in on
demand and do not consume memory prior to being referenced. The
trackObjs package also tracks times when objects are created and
modified, and caches some basic characteristics of objects to
allow for fast summaries of objects.
Each object is stored in a separate RData file using the standard
format as used by 'save()', so that objects can be manually picked
out of or added to the trackObjs database if needed.
Tracking works by replacing a tracked variable by an
'activeBinding', which when accessed looks up information in an
associated 'tracking environment' and reads or writes the
corresponding RData file and/or gets or assigns the variable in
the tracking environment.
Details:
There are three main reasons to use the 'trackObjs' package:
* conveniently handle many moderately-large objects that would
collectively exhaust memory or be inconvenient to manage in
files by manually using 'save()' and 'load()'
* keep track of creation and modification times on objects
* get fast summaries of basic characteristics of objects -
class, size, dimension, etc.
There is an option to control whether tracked objects are cached
in memory as well as being stored on disk. By default, objects
are not cached. To save time when working with collections of
objects that will all fit in memory, turn on caching with
'track.options(cache=TRUE)', or start tracking with
'track.start(..., cache=TRUE)'.
Here is a brief example of tracking some variables in the global
environment:
> library(trackObjs)
> track.start("tmp1")
> x <- 123 # Not yet tracked
> track(x) # Variable 'x' is now tracked
> track(y <- matrix(1:6, ncol=2)) # 'y' is assigned & tracked
> z1 <- list("a", "b", "c")
> z2 <- Sys.time()
> track(list=c("z1", "z2")) # Track a bunch of variables
> track.summary(size=F) # See a summary of tracked vars
class mode extent length modified TA TW
x numeric numeric [1] 1 2007-09-07 08:50:58 0 1
y matrix numeric [3x2] 6 2007-09-07 08:50:58 0 1
z1 list list [[3]] 3 2007-09-07 08:50:58 0 1
z2 POSIXt,POSIXct numeric [1] 1 2007-09-07 08:50:58 0 1
> # (TA="total accesses", TW="total writes")
> ls(all=TRUE)
[1] "x" "y" "z1" "z2"
> track.stop() # Stop tracking
> ls(all=TRUE)
character(0)
>
> # Restart using the tracking dir -- the variables reappear
> track.start("tmp1") # Start using the tracking dir again
> ls(all=TRUE)
[1] "x" "y" "z1" "z2"
> track.summary(size=F)
class mode extent length modified TA TW
x numeric numeric [1] 1 2007-09-07 08:50:58 0 1
y matrix numeric [3x2] 6 2007-09-07 08:50:58 0 1
z1 list list [[3]] 3 2007-09-07 08:50:58 0 1
z2 POSIXt,POSIXct numeric [1] 1 2007-09-07 08:50:58 0 1
> track.stop()
>
> # the files in the tracking directory:
> list.files("tmp1", all=TRUE)
[1] "." ".."
[3] "filemap.txt" ".trackingSummary.rda"
[5] "x.rda" "y.rda"
[7] "z1.rda" "z2.rda"
>
There are several points to note:
* The global environment is the default environment for
tracking - it is possible to track variables in other
environments, but that environment must be supplied as an
argument to the track functions.
* Vars must be explicitly 'track()'ed - newly created objects
are not tracked. (This is not a "feature", but there is
currently no way of automatically tracking newly created
objects - this is on the wishlist.) Thus, it is possible
for variables in a tracked environment to either tracked or
untracked.
* When tracking is stopped, all tracked variables are saved on
disk and will be no longer accessible until tracking is
started again.
* The objects are stored each in their own file in the
tracking dir, in the format used by 'save()'/'load()' (RData
files).
List of basic functions and common calling patterns:
Six functions cover the majority of common usage of the trackObjs
package:
* 'track.start(dir=...)': start tracking the global
environment, with files saved in 'dir'
* 'track.stop()': stop tracking (any unsaved tracked variables
are saved to disk and all tracked variables become
unavailable until tracking starts again)
* 'track(x)': start tracking 'x' - 'x' in the global
environment is replaced by an active binding and 'x' is
saved in its corresponding file in the tracking directory
and, if caching is on, in the tracking environment
* 'track(x <- value)': start tracking 'x'
* 'track(list=c('x', 'y'))': start tracking specified
variables
* 'track(all=TRUE)': start tracking all untracked variables in
the global environment
* 'untrack(x)': stop tracking variable 'x' - the R object 'x'
is put back as an ordinary object in the global environment
* 'untrack(all=TRUE)': stop tracking all variables in the
global environment (but tracking is still set up)
* 'untrack(list=...)': stop tracking specified variables
* 'track.summary()': print a summary of the basic
characteristics of tracked variables: name, class, extent,
and creation, modification and access times.
* 'track.remove(x)': completely remove all traces of 'x' from
the global environment, tracking environment and tracking
directory. Note that if variable 'x' in the global
environment is tracked, 'remove(x)' will make 'x' an
"orphaned" variable: 'remove(x)' will just remove the active
binding from the global environment, and leave 'x' in the
tracked environment and on file, and 'x' will reappear after
restarting tracking.
Complete list of functions and common calling patterns:
The 'trackObjs' package provides many additional functions for
controlling how tracking is performed (e.g., whether or not
tracked variables are cached in memory), examining the state of
tracking (show which variables are tracked, untracked, orphaned,
masked, etc.) and repairing tracking environments and databases
that have become inconsistent or incomplete (this may result from
resource limitiations, e.g., being unable to write a save file due
to lack of disk space, or from manual tinkering, e.g., dropping a
new save file into a tracking directory.)
[truncated here -- see ?trackObjs]
-- Tony Plate
PS: to give credit where due, the end of ?trackObjs says:
References:
Roger D. Peng. Interacting with data using the filehash package. R
News, 6(4):19-24, October 2006.
'http://cran.r-project.org/doc/Rnews' and
'http://sandybox.typepad.com/software'
David E. Brahm. Delayed data packages. R News, 2(3):11-12,
December 2002. 'http://cran.r-project.org/doc/Rnews'
See Also:
[...]
Inspriation from the packages 'g.data' and 'filehash'.
More information about the R-packages
mailing list