[R] R 1.2.1 - read.table - factors problem or is it a data.frame problem
gordon.harrington@uni.edu
gordon.harrington at uni.edu
Sun Feb 4 23:33:35 CET 2001
Brian Ripley notes:
> On Fri, 2 Feb 2001, Martin Maechler wrote:
>
> > >>>>> "PD" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:
> >
> > PD> "Heberto Ghezzo" <Heberto at meakins.lan.mcgill.ca> writes:
> > >> I have some problems with read.table and floats turning up as
> > >> factors. In my case it was not a blank in the file but an unary
> > >> minus!! so 3.24,-57.23,... the 3.24 is numeric but -57.23 is a
> > >> factor. Yes I turned it into a numeric with
> > >> as.numeric(as.character(.. but I think it will be better to modify
> > >> somehow the read.table/read.csv code.
> > >> Thanks anyway.
> >
> > PD> That certainly sounds like a bug, but I can't reproduce it:
> >
> > PD> $ cat > xx
> > PD> -1,2,3
> > PD> 1,-2,3
> > PD> $ R
> > PD> ...
> > >> summary(read.csv('xx',head=F))
> > PD> V1 V2 V3
> > PD> Min. :-1.0 Min. :-2 Min. :3
> > PD> 1st Qu.:-0.5 1st Qu.:-1 1st Qu.:3
> > PD> Median : 0.0 Median : 0 Median :3
> > PD> Mean : 0.0 Mean : 0 Mean :3
> > PD> 3rd Qu.: 0.5 3rd Qu.: 1 3rd Qu.:3
> > PD> Max. : 1.0 Max. : 2 Max. :3
> >
> > PD> Could you give us some further details on the setup that is
> > causing PD> that effect?
> >
> > Heberto uses a Windoze mailer, hence probably ..
> >
> > It could be that the problem comes from the fact that some win users
> > use non-ASCII minus characters (i.e. not "minus", but these find them on
> > their keyboards when typing in the data ..):
> >
> > In iso_8859-1 aka "latin-1" (of which most European MSWin localizations
> > are said to be a superset) there are three kinds of "-" :
> >
> > Oct Dec Hex Char Description
> >
> > --------------------------------------------------------------------
055 45 2D - Minux [The standard ASCII one]
> >
> > 255 173 AD SOFT HYPHEN
> >
> > 257 175 AF ¯ MACRON
>
> Actually, not as far as I can find out (and I have been working on
> encodings for the next releases of R). The first really is hyphen in both
> latin-1 and WinAnsi (the main Windows char set: the other, WinOEM, is not a
> superset of latin-1). Minus is not in the WinAnsi char set, but it does
> have hyphen at 45 and 173 (it has two spaces too).
>
> Unfortunately Adobe's ISOLatin1 encoding for postscript is not the same as
> latin-1. That does have minus at 45 and (real) hyphen at 173.
>
> As Windows NT/2000 machines support Unicode, on those the set of
> possible inputs is much wider and I don't think R will cope with
> Unicode-encoded files. In Unicode minus is at 138 (and hyphen at 45).
>
> It's a possible explanation, but then I don't think
> as.numeric(as.character( would work. My guess was that there was some
> other non-printing character in that field, but that has the same
> counter-argument.
>
I had sought help a few days earlier for a problem with some similarities. In
my case I had failed to recognize the existence of some NA's. I had a data set
which originated in 1966. Some IBM statistical packages of the era encoded NA's
as binary negative zeros. These were propogated in passes through the SAS first
edition. I can't remember how they were then encoded in EBCDIC by different
FORTRAN compilers, nor ultimately in ASCII conversions. However they relied on
program filters and were otherwise invisible.
Gordon M. Harrington Mail: 3720 Village Place, #6308
Professor Emeritus Waterloo, IA 50702-5848
University of Northern Iowa Phone: 319-291-8535
gordon.harrington at uni.edu Fax: 319-291-8491
dryfly at aya.yale.edu 319-291-8324
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list