[Rd] list_files() memory corruption?
Seth Falcon
seth at userprimary.net
Mon Mar 22 17:26:50 CET 2010
On 3/20/10 2:03 PM, Seth Falcon wrote:
> On 3/20/10 1:36 PM, Alistair Gee wrote:
>> I fixed my build problems. I also noticed that my patch wasn't
>> correct, so I have attached a new version.
>>
>> This fix still grows the vector by doubling it until it is big enough,
>> but the length is reset to the correct size at the end once it is
>> known.
>>
>> This fix differs from the existing fix in subversion in the following scenario:
>>
>> 1.Create file Z in directory with 1 other file named Y
>> 2. Call dir() to retrieve list of files.
>> 3. dir() counts 2 files.
>> 4. While dir() is executing, some other process creates file X in the directory.
>> 5. dir() retrieves the list of files, stopping after 2 files. But by
>> chance, it retrieves files X and Y (but not Z).
>> 6. dir() returns files X and Y, which could be misinterpreted to mean
>> that file Z does not exist.
>>
>> In contrast, with the attached fix, dir() would return all 3 files.
>
> I think the scenario you describe could happen with either version.
> Once you've read the files in a directory, all bets are off. Anything
> could happen between the time you readdir() and return results back to
> the user. I agree, though, that avoiding two calls to readdir narrows
> the window.
>
>> Also, the existing fix in subversion doesn't seem to handle the case
>> where readdir() returns fewer files than was originally counted as it
>> doesn't decrease the length of the vector.
>
> Yes, that's a limitation of the current fix.
>
> Have you run 'make check-devel' with your patch applied? Have you run
> any simple tests for using dir() or list.files() with recursive=TRUE on
> a reasonably large directory and compared times and memory use reported
> by gc()?
>
> It is often the case that writing the patch is the easy/quick part and
> making sure that one hasn't introduced new infelicities or unintended
> behavior is the hard part.
>
> I will try to take another look at your latest patch.
I've applied a modified version of your patch. In the testing that I
did, avoiding the counting step resulted in almost 2x faster times for
large directory listings with recursive=TRUE at the cost of a bit more
memory.
The code also now includes a check for user interrupt, so that you can
C-c out of dir/list.files call more quickly.
Thanks for putting together the patch.
+ seth
--
Seth Falcon | @sfalcon | http://userprimary.net/
More information about the R-devel
mailing list