[Rd] read.csv
Petr Savicky
savicky at cs.cas.cz
Tue Jun 16 20:09:01 CEST 2009
On Sun, Jun 14, 2009 at 09:21:24PM +0100, Ted Harding wrote:
> On 14-Jun-09 18:56:01, Gabor Grothendieck wrote:
> > If read.csv's colClasses= argument is NOT used then read.csv accepts
> > double quoted numerics:
> >
> > 1: > read.csv(stdin())
> > 0: A,B
> > 1: "1",1
> > 2: "2",2
> > 3:
> > A B
> > 1 1 1
> > 2 2 2
> >
> > However, if colClasses is used then it seems that it does not:
> >
> >> read.csv(stdin(), colClasses = "numeric")
> > 0: A,B
> > 1: "1",1
> > 2: "2",2
> > 3:
> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> > na.strings, :
> > scan() expected 'a real', got '"1"'
> >
> > Is this really intended? I would have expected that a csv file
> > in which each field is surrounded with double quotes is acceptable
> > in both cases. This may be documented as is yet seems undesirable
> > from both a consistency viewpoint and the viewpoint that it should
> > be possible to double quote fields in a csv file.
>
> Well, the default for colClasses is NA, for which ?read.csv says:
> [...]
> Possible values are 'NA' (when 'type.convert' is used),
> [...]
> and then ?type.convert says:
> This is principally a helper function for 'read.table'. Given a
> character vector, it attempts to convert it to logical, integer,
> numeric or complex, and failing that converts it to factor unless
> 'as.is = TRUE'. The first type that can accept all the non-missing
> values is chosen.
>
> It would seem that type 'logical' won't accept integer (naively one
> might expect 1 --> TRUE, but see experiment below), so the first
> acceptable type for "1" is integer, and that is what happens.
> So it is indeed documented (in the R[ecursive] sense of "documented" :))
>
> However, presumably when colClasses is used then type.convert() is
> not called, in which case R sees itself being asked to assign a
> character entity to a destination which it has been told shall be
> integer, and therefore, since the default for as.is is
> as.is = !stringsAsFactors
> but for this ?read.csv says that stringsAsFactors "is overridden
> bu [sic] 'as.is' and 'colClasses', both of which allow finer
> control.", so that wouldn't come to the rescue either.
>
> Experiment:
> X <-logical(10)
> class(X)
> # [1] "logical"
> X[1]<-1
> X
> # [1] 1 0 0 0 0 0 0 0 0 0
> class(X)
> # [1] "numeric"
> so R has converted X from class 'logical' to class 'numeric'
> on being asked to assign a number to a logical; but in this
> case its hands were not tied by colClasses.
>
> Or am I missing something?!!
In my opinion, you explain, how it happens that there is a difference
in the behavior between
read.csv(stdin(), colClasses = "numeric")
and
read.csv(stdin())
but not, why it is so.
The algorithm "use the smallest type, which accepts all non-missing values"
may well be applied to the input values either literally or after removing
the quotes. Is there a reason, why
read.csv(stdin(), colClasses = "numeric")
removes quotes from the input values and
read.csv(stdin())
does not?
Using double-quote characters is a part of the definition of CSV file, see,
for example
http://en.wikipedia.org/wiki/Comma_separated_values
where one may find
Fields may always be enclosed within double-quote characters, whether necessary or not.
Petr.
More information about the R-devel
mailing list