[Rd] Comments requested on "changedFiles" function

Karl Millar kmillar at google.com
Fri Sep 6 08:46:36 CEST 2013


Comments inline:


On Wed, Sep 4, 2013 at 6:10 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
>
> On 13-09-04 8:02 PM, Karl Millar wrote:
>>
>> Hi Duncan,
>>
>> I think this functionality would be much easier to use and understand if
>> you split it up the functionality of taking snapshots and comparing them
>> into separate functions.
>
>
> Yes, that's another possibility.  Some more comment below...
>
>
>
>  In addition, the 'timestamp' functionality
>>
>> seems both confusing and brittle to me.  I think it would be better to
>> store file modification times in the snapshot and use those instead of
>> an external file.  Maybe:
>
>
> You can do that, using file.info = "mtime", but the file.info snapshots are quite a bit slower than using the timestamp file (when looking at a big recursive directory of files).


Sorry, I completely failed to explain what I was thinking here.  There
are a number of issues here, but the biggest one is that you're
implicitly assuming that files that get modified will have mtimes that
come after the timestamp file was created.  This isn't always true,
with the most notable exception being if you download a package from
CRAN and untar it, the mtimes are usually well in the past (at least
with GNU tar on a linux system), so this code won't notice that the
files have changed.

It may be a good idea to store the file sizes as well, which would
help prevent false negatives in the (rare IIRC) cases where the
contents have changed but the mtime values have not.  Since you
already need to call file.info() to get the mtime, this shouldn't
increase the runtime, and the extra memory needed is fairly modest.

>>
>> # Take a snapshot of the files.
>> takeFileSnapshot(directory, file.info <http://file.info> = TRUE, md5sum
>>
>> = FALSE, full.names = FALSE, recursive = TRUE, ...)
>>
>> # Take a snapshot using the same options as used for snapshot.
>> retakeFileSnapshot(snapshot, directory = snapshot$directory) {
>>     takeFileSnapshot)(directory, file.info <http://file.info> =
>> snapshot$file.info <http://file.info>, md5sum = snapshot$md5sum, etc)
>>
>> }
>>
>> compareFileSnapshots(snapshot1, snapshot2)
>> - or -
>> getNewFiles(snapshat1, snapshot2)       # These names are probably too
>> generic
>> getDeletedFiles(snapshot1, snapshot2)
>> getUpdatedFiles(snapshot1, snapshot2)
>> -or-
>> setdiff(snapshot1, snapshot2)  # Unclear how this should treat updated files
>>
>>
>> This approach does have the difficulty that users could attempt to
>> compare snapshots that were taken with different options and that can't
>> be compared, but that should be an easy error to detect.
>
>
> I don't want to add too many new functions.  The general R style is to have functions that do a lot, rather than have a lot of different functions to achieve different parts of related tasks.  This is better for interactive use (fewer functions to remember, a simpler help system to navigate), though it probably results in less readable code.


This is somewhat more nuanced and not particular to interactive use
IMHO.  Having functions that do a lot is good, _as long as the
semantics are always consistent_.  For example, lm() does a huge
amount and has a wide variety of ways that you can specify your data,
but it basically does the same thing no matter how you use it.  On the
other hand, if you have a function that does different things
depending on how you call it (e.g. reshape()) then it's easy to
remember the function name, but much harder to remember how to call it
correctly, harder to understand the documentation and less readable.

>
> I can see an argument for two functions (a get and a compare), but I don't think there are many cases where doing two gets and comparing the snapshots would be worth the extra runtime.  (It's extra because file.info is only a little faster than list.files, and it would be unavoidable to call both twice in that version.  Using the timestamp file avoids one of those calls, and replaces the other with file_test, which takes a similar amount of time.  So overall it's about 20-25% faster.)  It also makes the code a bit more complicated, i.e. three calls (get, get, compare) instead of two (get, compare).


I think a 'snapshotDirectory' and 'compareDirectoryToSnapshot'
combination might work well.

Thanks,

Karl



More information about the R-devel mailing list