[Rd] Comments requested on "changedFiles" function

Scott Kostyshak skostysh at princeton.edu
Thu Sep 5 05:36:45 CEST 2013


On Wed, Sep 4, 2013 at 1:53 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
> In a number of places internal to R, we need to know which files have
> changed (e.g. after building a vignette).  I've just written a general
> purpose function "changedFiles" that I'll probably commit to R-devel.
> Comments on the design (or bug reports) would be appreciated.
>
> The source for the function and the Rd page for it are inline below.

This looks like a useful function. Thanks for writing it. I have only
one (picky) comment below.

> ----- changedFiles.R:
> changedFiles <- function(snapshot, timestamp = tempfile("timestamp"),
> file.info = NULL,
>              md5sum = FALSE, full.names = FALSE, ...) {
>     dosnapshot <- function(args) {
>         fullnames <- do.call(list.files, c(full.names = TRUE, args))
>         names <- do.call(list.files, c(full.names = full.names, args))
>         if (isTRUE(file.info) || (is.character(file.info) &&
> length(file.info))) {
>             info <- file.info(fullnames)
>         rownames(info) <- names
>             if (isTRUE(file.info))
>                 file.info <- c("size", "isdir", "mode", "mtime")
>         } else
>             info <- data.frame(row.names=names)
>     if (md5sum)
>         info <- data.frame(info, md5sum = tools::md5sum(fullnames))
>     list(info = info, timestamp = timestamp, file.info = file.info,
>          md5sum = md5sum, full.names = full.names, args = args)
>     }
>     if (missing(snapshot) || !inherits(snapshot, "changedFilesSnapshot")) {
>         if (length(timestamp) == 1)
>             file.create(timestamp)
>         if (missing(snapshot)) snapshot <- "."
>         pre <- dosnapshot(list(path = snapshot, ...))
>         pre$pre <- pre$info
>         pre$info <- NULL
>         pre$wd <- getwd()
>         class(pre) <- "changedFilesSnapshot"
>         return(pre)
>     }
>
>     if (missing(timestamp)) timestamp <- snapshot$timestamp
>     if (missing(file.info) || isTRUE(file.info)) file.info <-
> snapshot$file.info
>     if (identical(file.info, FALSE)) file.info <- NULL
>     if (missing(md5sum))    md5sum <- snapshot$md5sum
>     if (missing(full.names)) full.names <- snapshot$full.names
>
>     pre <- snapshot$pre
>     savewd <- getwd()
>     on.exit(setwd(savewd))
>     setwd(snapshot$wd)
>
>     args <- snapshot$args
>     newargs <- list(...)
>     args[names(newargs)] <- newargs
>     post <- dosnapshot(args)$info
>     prenames <- rownames(pre)
>     postnames <- rownames(post)
>
>     added <- setdiff(postnames, prenames)
>     deleted <- setdiff(prenames, postnames)
>     common <- intersect(prenames, postnames)
>
>     if (length(file.info)) {
>         preinfo <- pre[common, file.info]
>         postinfo <- post[common, file.info]
>         changes <- preinfo != postinfo
>     }
>     else changes <- matrix(logical(0), nrow = length(common), ncol = 0,
>                            dimnames = list(common, character(0)))
>     if (length(timestamp))
>         changes <- cbind(changes, Newer = file_test("-nt", common,
> timestamp))
>     if (md5sum) {
>         premd5 <- pre[common, "md5sum"]
>         postmd5 <- post[common, "md5sum"]
>     changes <- cbind(changes, md5sum = premd5 != postmd5)
>     }
>     changes1 <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop = FALSE]
>     changed <- rownames(changes1)
>     structure(list(added = added, deleted = deleted, changed = changed,
>         unchanged = setdiff(common, changed), changes = changes), class =
> "changedFiles")
> }
>
> print.changedFilesSnapshot <- function(x, ...) {
>     cat("changedFiles snapshot:\n timestamp = \"", x$timestamp, "\"\n
> file.info = ",
>         if (length(x$file.info)) paste(paste0('"', x$file.info, '"'),
> collapse=","),
>         "\n md5sum = ", x$md5sum, "\n args = ", deparse(x$args, control =
> NULL), "\n", sep="")
>     x
> }
>
> print.changedFiles <- function(x, ...) {
>     if (length(x$added)) cat("Files added:\n",  paste0("  ", x$added,
> collapse="\n"), "\n", sep="")
>     if (length(x$deleted)) cat("Files deleted:\n",  paste0("  ", x$deleted,
> collapse="\n"), "\n", sep="")
>     changes <- x$changes
>     changes <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop=FALSE]
>     changes <- changes[, colSums(changes, na.rm = TRUE) > 0, drop=FALSE]
>     if (nrow(changes)) {
>         cat("Files changed:\n")
>         print(changes)
>     }
>     x
> }
> ----------------------
>
> --- changedFiles.Rd:
> \name{changedFiles}
> \alias{changedFiles}
> \alias{print.changedFiles}
> \alias{print.changedFilesSnapshot}
> \title{
> Detect which files have changed
> }
> \description{
> On the first call, \code{changedFiles} takes a snapshot of a selection of
> files.  In subsequent
> calls, it takes another snapshot, and returns an object containing data on
> the
> differences between the two snapshots.  The snapshots need not be the same
> directory;
> this could be used to compare two directories.
> }
> \usage{
> changedFiles(snapshot, timestamp = tempfile("timestamp"), file.info = NULL,
>              md5sum = FALSE, full.names = FALSE, ...)
> }
> \arguments{
>   \item{snapshot}{
> The path to record, or a previous snapshot.  See the Details.
> }
>   \item{timestamp}{
> The name of a file to write at the time the initial snapshot
> is taken.  In subsequent calls, modification times of files will be compared
> to
> this file, and newer files will be reported as changed.  Set to \code{NULL}
> to skip this test.
> }
>   \item{file.info}{
> A vector of columns from the result of the \code{file.info} function, or a
> logical value.  If
> \code{TRUE}, columns \code{c("size", "isdir", "mode", "mtime")} will be
> used.  Set to
> \code{FALSE} or \code{NULL} to skip this test.  See the Details.
> }
>   \item{md5sum}{
> A logical value indicating whether MD5 summaries should be taken as part of
> the snapshot.
> }
>   \item{full.names}{
> A logical value indicating whether full names (as in
> \code{\link{list.files}}) should be
> recorded.
> }
>   \item{\dots}{
> Additional parameters to pass to \code{\link{list.files}} to control the set
> of files
> in the snapshots.
> }
> }
> \details{
> This function works in two modes.  If the \code{snapshot} argument is
> missing or is
> not of S3 class \code{"changedFilesSnapshot"}, it is used as the \code{path}
> argument
> to \code{\link{list.files}} to obtain a list of files.  If it is of class
> \code{"changedFilesSnapshot"}, then it is taken to be the baseline file
> and a new snapshot is taken and compared with it.  In the latter case,
> missing
> arguments default to match those from the initial snapshot.
>
> If the \code{timestamp} argument is length 1, a file with that name is
> created
> in the current directory during the initial snapshot, and
> \code{\link{file_test}}
> is used to compare the age of all files to it during subsequent calls.
>
> If the \code{file.info} argument is \code{TRUE} or it contains a non-empty
> character vector, the indicated columns from the result of a call to
> \code{\link{file.info}} will be recorded and compared.
>
> If \code{md5sum} is \code{TRUE}, the \code{tools::\link{md5sum}} function
> will be called to record the 32 byte MD5 checksum for each file, and these
> values
> will be compared.
> }
> \value{
> In the initial snapshot phase, an object of class
> \code{"changedFilesSnapshot"} is returned.  This
> is a list containing the fields
> \item{pre}{a dataframe whose rownames are the filenames, and whose columns
> contain the
> requested snapshot data}
> \item{timestamp, file.info, md5sum, full.names}{a record of the arguments in
> the initial call}
> \item{args}{other arguments passed via \code{...} to
> \code{\link{list.files}}.}
>
> In the comparison phase, an object of class \code{"changedFiles"}. This is a
> list containing
> \item{added, deleted, changed, unchanged}{character vectors of filenames
> from the before
> and after snapshots, with obvious meanings}
> \item{changes}{a logical matrix with a row for each common file, and a
> column for each
> comparison test.  \code{TRUE} indicates a change in that test.}
>
> \code{\link{print}} methods are defined for each of these types. The
> \code{\link{print}} method for \code{"changedFilesSnapshot"} objects
> displays the arguments used to produce it, while the one for
> \code{"changedFiles"} displays the \code{added}, \code{deleted}
> and \code{changed} fields if non-empty, and a submatrix of the
> \code{changes}
> matrix containing all of the \code{TRUE} values.
> }
> \author{
> Duncan Murdoch
> }
> \seealso{
> \code{\link{file.info}}, \code{\link{file_test}}, \code{\link{md5sum}}.
> }
> \examples{
> # Create some files in a temporary directory
> dir <- tempfile()
> dir.create(dir)

Should a different name than 'dir' be used since 'dir' is a base function?

Further, if someone is not very familiar with R (or just not in "R
mode" at the time of reading), they might think that 'dir.create' is
calling the create member of the object named 'dir' that you just
made.

Scott

> writeBin(1, file.path(dir, "file1"))
> writeBin(2, file.path(dir, "file2"))
> dir.create(file.path(dir, "dir"))
>
> # Take a snapshot
> snapshot <- changedFiles(dir, file.info=TRUE, md5sum=TRUE)
>
> # Change one of the files
> writeBin(3, file.path(dir, "file2"))
>
> # Display the detected changes
> changedFiles(snapshot)
> changedFiles(snapshot)$changes
> }
> \keyword{utilities}
> \keyword{file}
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


--
Scott Kostyshak
Economics PhD Candidate
Princeton University



More information about the R-devel mailing list