[Rd] Inconsistency, may be bug in read.delim ?

Wed Mar 21 16:57:18 CET 2018

On 03/19/2018 02:23 PM, Detlef Steuer wrote:
> Dear friends,
>
> I stumbled into beheaviour of read.delim which I would consider a bug
> or at least an inconsistency that should be improved upon.
>
> Recently we had to work with data that used "", two double quotes, as
> symbol to start and end character input.
>
> Essentially the data looked like this
>
> data.csv
> ========
> V1, V2, V3
> ""data"", 3, """"
>
> The last sequence of """" indicating a missing.
After processing the quotes, this is internally parsed as

data 3 "

Which I think is correct; in particular, """" represents single quote. 
This is correct and it conforms to RFC 4180. "" in contrast represents 
an empty string.

Based on my reading of RFC4180, ""data"" is not a valid field, but not 
every CSV file follows that RFC, and R supports this pattern as expected 
in your data. So you should be fine here.

> One obvious solution to read in this data is using some gsub(),
> but that's not the point I want to make.
>
> Consider this case we found during tests:
>
> test.csv
> ========
> V1, V2, V3, V4
> """", """", 3, ""
>
> and read it with
>> read.delim("test.csv", sep=",", header=TRUE, na.strings="\"")
After processing the quotes, this is internally parsed as
" " 3 <empty_string>

which is again I think correct (and conforms to RFC 4180)

> you get the following
>
>    V1 V2 V3 V4
> 1 NA  "  3 NA
>
> (and a warning)

I do not get the warning on my system. The reason why the second " is 
not translated to NA by na.strings is white space after the comma in the 
CSV file, this works more consistently:

 > read.delim("test.csv", sep=",", header=TRUE, na.strings="\"", 
strip.white=TRUE)
   V1 V2 V3 V4
1 NA NA  3 NA

If one needed to differentiate between " and <empty_string>, then it 
might be necessary to run without the na.strings argument.

Best
Tomas

> I would have assumed to get some error message or at
> least the same result for both appearances of """" in the
> input file.
> (the setting na.strings="\"" turned out to be working for
>   a colleague and his specific data, while I think it shouldn't)
>
> My main concern is the different interpretation for the two """"
> sequences.
>
> Real bug? Minor inconsistency? I don't know.
>
> All the best
> Detlef
>
>