[Rd] read.table / type.convert with NA values
Peter Ehlers
ehlers at ucalgary.ca
Wed Jun 30 03:45:35 CEST 2010
Is there a compelling reason to have strip.white default
to FALSE? It seems to me that it would be more common to
want the TRUE case.
Having said that, I must confess that I've never had the
problem Erik describes.
-Peter Ehlers
On 2010-06-29 17:14, Matt Shotwell wrote:
> The document RFC 4180 (which appears to be the CSV standard used by R,
> see ?read.table) considers spaces to be part of the fielded value. Some
> have taken this to mean that all white space characters should be
> considered part of the fielded value, though the RFC is not explicit
> here. Hence, this behavior is in compliance with the "standard" for CSV
> files. It seems that R treats '\t' (and perhaps all?) separated value
> files the same way by default.
>
> The RFC is very short and easy to read if you're interested.
> http://tools.ietf.org/html/rfc4180
>
> -Matt
>
> On Tue, 2010-06-29 at 16:41 -0400, Erik Iverson wrote:
>> Hello,
>>
>> While assisting a fellow R-helper off list, I narrowed down an issue he
>> was having to the following behavior of type.convert, called through
>> read.table. This is using R 2.10.1, if newer versions don't exhibit
>> this behavior, apologies.
>>
>> # generates numeric vector
>> > type.convert(c("123.42", "NA"))
>> [1] 123.42 NA
>>
>> # generates a numeric vector, notice the space before 123.42
>> > type.convert(c(" 123.42 ", "NA"))
>> [1] 123.42 NA
>>
>> # generates factor, notice the space before NA
>> # note that the 2nd element is actually " NA", not a true NA value
>> > type.convert(c("123.42", " NA"))
>> [1] 123.42 NA
>> Levels: 123.42 NA
>>
>>
>> How can this affect read.table/read.csv use 'in the wild'?
>>
>> This gentleman had a data file that was
>>
>> 1) delimited by something other than white space, CSV in his case
>> 2) contained missing values, designated by NA in his case
>> 3) contained white space between delimiters and data values, e.g.,
>>
>> NA, NA, 4.5, NA
>>
>> as opposed to
>>
>> NA,NA,4.5,NA
>>
>>
>> With these 3 conditions met, read.table gives type.convert a character
>> vector like my third example above, and ultimately he got a data.frame
>> consisting of only factors when we were expecting numeric columns. This
>> was easily fixed either by modifying the read.csv function call to
>> specify colClasses directly, or in his case, strip.white = TRUE did the
>> job just fine.
>>
>> I believe the confusion stems from the fact that with no NA values in
>> our data file, this would work as we would expect. The introduction of
>> what we thought were NA values changed the behavior. In reality, these
>> were not being treated as NA values by read.table/type.convert. The
>> question is, should they be in this case?
>>
>> This behavior of read.table/type.convert may very well be what is
>> expected/needed. If so, this note could still be of use to someone in
>> the future if they stumble upon similar behavior. The fact I wasn't
>> able to uncover anyone who asked about it on list before probably means
>> the situation is rare.
>>
>> Best Regards,
>> Erik Iverson
>>
More information about the R-devel
mailing list