[Rd] [External] readChar() could read the whole file by default?

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Mon Jan 29 19:23:41 CET 2024


On 29/01/2024 1:09 p.m., Toby Hocking wrote:
> My opinion is that the proposed feature would be greatly appreciated by users.
> I had always wondered if I was the only one doing paste(readLines(f),
> collapse="\n") all the time.
> It would be great to have the proposed, more straightforward way to
> read the whole file as a string: readChar("my_file.txt", -1) or even
> better readChar("my_file.txt")
> Thanks for your detailed analysis Michael.

These two things aren't the same:

   paste(readLines(f), collapse = "\n")

is not the same as

   readChar(f, file.size(f))

in cases where the file has Windows-style newlines and you're reading it 
on Unix, because the first one converts the CR LF newlines into \n, 
while the second would give \r\n.  I think they would match for reading 
Unix-style files on Windows.)

Does this ever matter?  I don't know, but I think usually people would 
want the behaviour of paste(readLines(f), collapse = "\n").

Duncan Murdoch


> 
> On Fri, Jan 26, 2024 at 2:05 PM luke-tierney--- via R-devel
> <r-devel using r-project.org> wrote:
>>
>> On Fri, 26 Jan 2024, Michael Chirico wrote:
>>
>>> I am curious why readLines() has a default (n=-1L) to read the full
>>> file while readChar() has no default for nchars= (i.e., readChar(file)
>>> is an error). Is there a technical reason for this?
>>>
>>> I often[1] see code like paste(readLines(f), collapse="\n") which
>>> would be better served by readChar(), especially given issues with the
>>> global string cache I've come across[2]. But lacking the default, the
>>> replacement might come across less clean.
>>
>> The string cache seems like a very dark pink herring to me. The fact
>> that the lines are allocated on the heap might create an issue; the
>> cache isn't likely to add much to that. In any case I would need to
>> see a realistic example to convince me this is worth addressing on
>> performance grounds.
>>
>> I don't see any reason in principle not to have readChar and readBin
>> read the entire file if n = -1 (others might) but someone would need
>> to write a patch to implement that.
>>
>> Best,
>>
>> luke
>>
>>> For my own purposes the incantation readChar(file, file.size(file)) is
>>> ubiquitous. Taking CRAN code[3] as a sample[4], 41% of readChar()
>>> calls use either readChar(f, file.info(f)$size) or readChar(f,
>>> file.size(f))[5].
>>>
>>> Thanks for the consideration and feedback,
>>> Mike C
>>>
>>> [1] e.g. a quick search shows O(100) usages in CRAN packages:
>>> https://github.com/search?q=org%3Acran+%2Fpaste%5B%28%5D%5Cs*readLines%5B%28%5D.*%5B%29%5D%2C%5Cs*collapse%5Cs*%3D%5Cs*%5B%27%22%5D%5B%5C%5C%5D%2F+lang%3AR&type=code,
>>> and O(1000) usages generally on GitHub:
>>> https://github.com/search?q=lang%3AR+%2Fpaste%5B%28%5D%5Cs*readLines%5B%28%5D.*%5B%29%5D%2C%5Cs*collapse%5Cs*%3D%5Cs*%5B%27%22%5D%5B%5C%5C%5D%2F+lang%3AR&type=code
>>> [2] AIUI the readLines() approach "pollutes" the global string cache
>>> with potentially 1000s/10000s of strings for each line, only to get
>>> them gc()'d after combining everything with paste(collapse="\n")
>>> [3] The mirror on GitHub, which includes archived packages as well as
>>> current (well, eventually-consistent) versions.
>>> [4] Note that usage in packages is likely not representative of usage
>>> in scripts, e.g. I often saw readChar(f, 1), or eol-finders like
>>> readChar(f, 500) + grep("[\n\r]"), which makes more sense to me as
>>> something to find in package internals than in analysis scripts. FWIW
>>> I searched an internal codebase (scripts and packages) and found 70%
>>> of usages reading the full file.
>>> [5] repro: https://gist.github.com/MichaelChirico/247ea9500460dca239f031e74bdcf76b
>>> requires GitHub PAT in env GITHUB_PAT for API permissions.
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> --
>> Luke Tierney
>> Ralph E. Wareham Professor of Mathematical Sciences
>> University of Iowa                  Phone:             319-335-3386
>> Department of Statistics and        Fax:               319-335-3017
>>      Actuarial Science
>> 241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list