[R-pkgs] new package 'trackObjs' - mirror objects to files, provide summaries & modification times

Tony Plate tplate at acm.org
Mon Sep 10 23:16:28 CEST 2007

 From ?trackObjs:

Overview of trackObjs package


      The trackObjs package sets up a link between R objects in memory
      and files on disk so that objects are automatically resaved to
      files when they are changed.  R objects in files are read in on
      demand and do not consume memory prior to being referenced.  The
      trackObjs package also tracks times when objects are created and
      modified, and caches some basic characteristics of objects to
      allow for fast summaries of objects.

      Each object is stored in a separate RData file using the standard
      format as used by 'save()', so that objects can be manually picked
      out of or added to the trackObjs database if needed.

      Tracking works by replacing a tracked variable by an
      'activeBinding', which when accessed looks up information in an
      associated 'tracking environment' and reads or writes the
      corresponding RData file and/or gets or assigns the variable in
      the tracking environment.


      There are three main reasons to use the 'trackObjs' package:

         *  conveniently handle many moderately-large objects that would
            collectively exhaust memory or be inconvenient to manage in
            files by manually using 'save()' and 'load()'

         *  keep track of creation and modification times on objects

         *  get fast summaries of basic characteristics of objects -
            class, size, dimension, etc.

      There is an option to control whether tracked objects are cached
      in memory as well as being stored on disk.  By default, objects
      are not cached.  To save time when working with collections of
      objects that will all fit in memory, turn on caching with
      'track.options(cache=TRUE)', or start tracking with
      'track.start(..., cache=TRUE)'.

      Here is a brief example of tracking some variables in the global

      > library(trackObjs)
      > track.start("tmp1")
      > x <- 123                  # Not yet tracked
      > track(x)                  # Variable 'x' is now tracked
      > track(y <- matrix(1:6, ncol=2)) # 'y' is assigned & tracked
      > z1 <- list("a", "b", "c")
      > z2 <- Sys.time()
      > track(list=c("z1", "z2")) # Track a bunch of variables
      > track.summary(size=F)     # See a summary of tracked vars
                  class    mode extent length            modified TA TW
      x         numeric numeric    [1]      1 2007-09-07 08:50:58  0  1
      y          matrix numeric  [3x2]      6 2007-09-07 08:50:58  0  1
      z1           list    list  [[3]]      3 2007-09-07 08:50:58  0  1
      z2 POSIXt,POSIXct numeric    [1]      1 2007-09-07 08:50:58  0  1
      > # (TA="total accesses", TW="total writes")
      > ls(all=TRUE)
      [1] "x"  "y"  "z1" "z2"
      > track.stop()              # Stop tracking
      > ls(all=TRUE)
      > # Restart using the tracking dir -- the variables reappear
      > track.start("tmp1") # Start using the tracking dir again
     > ls(all=TRUE)
      [1] "x"  "y"  "z1" "z2"
      > track.summary(size=F)
                  class    mode extent length            modified TA TW
      x         numeric numeric    [1]      1 2007-09-07 08:50:58  0  1
      y          matrix numeric  [3x2]      6 2007-09-07 08:50:58  0  1
      z1           list    list  [[3]]      3 2007-09-07 08:50:58  0  1
      z2 POSIXt,POSIXct numeric    [1]      1 2007-09-07 08:50:58  0  1
      > track.stop()
      > # the files in the tracking directory:
      > list.files("tmp1", all=TRUE)
      [1] "."                    ".."
      [3] "filemap.txt"          ".trackingSummary.rda"
      [5] "x.rda"                "y.rda"
      [7] "z1.rda"               "z2.rda"

      There are several points to note:

         *  The global environment is the default environment for
            tracking - it is possible to track variables in other
            environments, but that environment must be supplied as an
            argument to the track functions.

         *  Vars must be explicitly 'track()'ed - newly created objects
            are not tracked.  (This is not a "feature", but there is
            currently no way of automatically tracking newly created
            objects - this is on the wishlist.)  Thus, it is possible
            for variables in a tracked environment to either tracked or

         *  When tracking is stopped, all tracked variables are saved on
            disk and will be no longer accessible until tracking is
            started again.

         *  The objects are stored each in their own file in the
            tracking dir, in the format used by 'save()'/'load()' (RData

List of basic functions and common calling patterns:

      Six functions cover the majority of common usage of the trackObjs

         *  'track.start(dir=...)': start tracking the global
            environment, with files saved in 'dir'

         *  'track.stop()': stop tracking (any unsaved tracked variables
            are saved to disk and all tracked variables become
            unavailable until tracking starts again)

         *  'track(x)': start tracking 'x' - 'x' in the global
            environment is replaced by an active binding and 'x' is
            saved in its corresponding file in the tracking directory
            and, if caching is on, in the tracking environment

         *  'track(x <- value)': start tracking 'x'

         *  'track(list=c('x', 'y'))': start tracking specified

         *  'track(all=TRUE)': start tracking all untracked variables in
            the global environment

         *  'untrack(x)': stop tracking variable 'x' - the R object 'x'
            is put back as an ordinary object in the global environment

         *  'untrack(all=TRUE)': stop tracking all variables in the
            global environment (but tracking is still set up)

         *  'untrack(list=...)': stop tracking specified variables

         *  'track.summary()': print a summary of the basic
            characteristics of tracked variables: name, class, extent,
            and creation, modification and access times.

         *  'track.remove(x)': completely remove all traces of 'x' from
            the global environment, tracking environment and tracking
            directory.   Note that if variable 'x' in the global
            environment is tracked, 'remove(x)' will make 'x' an
            "orphaned" variable: 'remove(x)' will just remove the active
            binding from the global environment, and leave 'x' in the
            tracked environment and on file, and 'x' will reappear after
            restarting tracking.

Complete list of functions and common calling patterns:

      The 'trackObjs' package provides many additional functions for
      controlling how tracking is performed (e.g., whether or not
      tracked variables are cached in memory), examining the state of
      tracking (show which variables are tracked, untracked, orphaned,
      masked, etc.) and repairing tracking environments and databases
      that have become inconsistent or incomplete (this may result from
      resource limitiations, e.g., being unable to write a save file due
      to lack of disk space, or from manual tinkering, e.g., dropping a
      new save file into a tracking directory.)

[truncated here -- see ?trackObjs]

-- Tony Plate

PS: to give credit where due, the end of ?trackObjs says:

      Roger D. Peng. Interacting with data using the filehash package. R
      News, 6(4):19-24, October 2006.
      'http://cran.r-project.org/doc/Rnews' and

      David E. Brahm. Delayed data packages. R News, 2(3):11-12,
      December 2002.  'http://cran.r-project.org/doc/Rnews'

See Also:
      Inspriation from the packages 'g.data' and 'filehash'.

More information about the R-packages mailing list