[Rd] Comments requested on "changedFiles" function

Karl Millar kmillar at google.com
Sat Sep 7 05:07:36 CEST 2013


On Fri, Sep 6, 2013 at 7:03 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
> On 13-09-06 9:21 PM, Karl Millar wrote:
>>
>> Hi Duncan,
>>
>> I like the interface of this version a lot better, but there's still a
>> bunch of implementation details that need fixing:
>>
>> * As previously mentioned, there are important cases where the mtime
>> values change in ways that this code doesn't detect.
>> * If the timestamp file (which is usually in the temp directory) gets
>> deleted (which can happen after a moderate amount of time of
>> inactivity on some systems), then the file_test('-nt', ...) will
>> always return false, even if the file has changed.
>
>
> If that happened without user intervention, I think it would break other
> things in R -- the temp directory is supposed to last for the whole session.
> But I should be checking anyway.

Yes, it does break other things in R -- my experience has been that
the help system seems to be the one that is impacted the most by this.
 FWIW, I've never seen the entire R temp directory deleted, just
individual files and subdirectories in it, but even that probably
depends on how the machine is configured.  I suspect only a few users
ever notice this, but my R use is probably somewhat anomalous and I
think it only happens to R sessions that I haven't used for a few
days.

>> * If files get added or deleted between the two calls to list.files in
>> fileSnapshot, it will fail with an error.
>
>
> Yours won't work if path contains more than one directory.  This is probably
> a reasonable restriction, but it's inconsistent with list.files, so I'd like
> to avoid it if I can find a way.

I'm currently unsure what the behaviour when comparing snapshots with
multiple directories should be.

Presumably we should have the property that (horribly abusing notation
for succinctness):
      compareSnapshots(c(a1, a2),  c(a1, a2))
is the same as concatenating (in some form)
      compareSnapshots(a1, a1) and compareSnapshots(a2, a2)
and there's a bunch of ways we could concatenate -- we could return a
list of results, or a single result where each of the 'added, deleted,
modified' fields are a list, or where we concatenate the 'added,
deleted, modified' fields together into three simple vectors.
Concatenating the vectors together like this is appealing, but unless
you're using the full names, it doesn't include the information of
which directory the changes are in, and using the full names doesn't
work in the case where you're comparing different sets of directories,
e.g. compareSnapshots(c(a1, a2), c(b1, b2)), where there is no
sensible choice for a full name.  The list options don't have this
problem, but are harder to work with, particularly for the common case
where there's only a single directory.  You'd also have to be somewhat
careful with filenames that occur in both directories.

Maybe I'm just being dense, but I don't see a way to do this thats
clear, easy to use and wouldn't confuse users at the moment.

Karl

> Duncan Murdoch
>
>
>> * If the path is on a remote file system, tempdir is local, and
>> there's significant clock skew, then you can get incorrect results.
>>
>> Unfortunately, these aren't just theoretical scenarios -- I've had the
>> misfortune to run up against all of them in the past.
>>
>> I've attached code that's loosely based on your implementation that
>> solves these problems AFAICT.  Alternatively, Hadley's code handles
>> all of these correctly, with the exception that compare_state doesn't
>> handle the case where safe_digest returns NA very well.
>>
>> Regards,
>>
>> Karl
>>
>> On Fri, Sep 6, 2013 at 5:40 PM, Duncan Murdoch <murdoch.duncan at gmail.com>
>> wrote:
>>>
>>> On 13-09-06 7:40 PM, Scott Kostyshak wrote:
>>>>
>>>>
>>>> On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch
>>>> <murdoch.duncan at gmail.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have now put the code into a temporary package for testing; if
>>>>>> anyone
>>>>>> is interested, for a few days it will be downloadable from
>>>>>>
>>>>>> fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sorry, error in the URL.  It should be
>>>>>
>>>>> http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
>>>>
>>>>
>>>>
>>>> Works well. A couple of things I noticed:
>>>>
>>>> (1)
>>>> md5sum is being called on directories, which causes warnings. (If this
>>>> is not viewed as undesirable, please ignore the rest of this comment.)
>>>> Should this be the responsibility of the user (by passing arguments to
>>>> list.files)? In the example, changing
>>>> fileSnapshot(dir, file.info=TRUE, md5sum=TRUE)
>>>> to
>>>> fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE,
>>>> recursive=TRUE")
>>>>
>>>> gets rid of the warnings. But perhaps the user just wants to exclude
>>>> directories for the md5sum calculations. This can't be controlled from
>>>> fileSnapshot.
>>>
>>>
>>>
>>> I don't see the warnings, I just get NA values.  I'll try to see why
>>> there's
>>> a difference.  (One possibility is my platform (Windows); another is that
>>> I'm generally testing in R-patched and R-devel rather than the 3.0.1
>>> release
>>> version.)  I would rather suppress the warnings than make the user avoid
>>> them.
>>>
>>>
>>>>
>>>> Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
>>>> or file.info to exclude directories (and then fill in the directories
>>>> with NA)?
>>>>
>>>> (2)
>>>> If I run example(changedFiles) several times, sometimes I get:
>>>>
>>>> chngdF> changedFiles(snapshot)
>>>> File changes:
>>>>         mtime md5sum
>>>> file2  TRUE   TRUE
>>>>
>>>> and other times I get:
>>>>
>>>> chngdF> changedFiles(snapshot)
>>>> File changes:
>>>>         md5sum
>>>> file2   TRUE
>>>>
>>>> I wonder why.
>>>
>>>
>>>
>>> Sometimes the example runs so quickly that the new version has exactly
>>> the
>>> same modification time as the original.  That's the risk of the mtime
>>> check.
>>> If you put a delay between, you'll get consistent results.
>>>
>>> Duncan Murdoch
>>>
>>>
>>>>
>>>> Scott
>>>>
>>>>> sessionInfo()
>>>>
>>>>
>>>> R Under development (unstable) (2013-08-31 r63780)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] testpkg_1.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] tools_3.1.0
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Scott Kostyshak
>>>> Economics PhD Candidate
>>>> Princeton University
>>>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>



More information about the R-devel mailing list