[R] CSV value not being read as it appears

Heinz Tuechler tuechler at gmx.at
Fri Jan 14 17:08:45 CET 2011


At 14.01.2011 07:09 -0800, Peter Ehlers wrote:
>On 2011-01-14 02:09, bgreen at dyson.brisnet.org.au wrote:
>>Brian,
>>
>>Thanks. My response to David follows. I should add that this problem has
>>never occurred previously as far as I know (I have now checked the
>>previous report I was sent):
>
>This problem occurs to me frequently. Like Philipp and David,
>I too always check imported categorical variables. The worst
>cases are trailing spaces (in quoted text).


These are still the best "worst cases". My favourite "worst cases" 
are entries like "5-10" or similar that are trasformed into dates, 
e.g. 05Oct2011. My problem is, however that I don't know any other 
universally known format to exchange data with a medical  colleague 
or with a social scientist.

Heinz


>It is hardly R's fault that Excel users routinely commit
>crimes against data.
>
>Peter Ehlers
>
>>Hello David,
>>
>>Thanks for your e-mail. The data was a report derived from a statewide
>>database, saved in EXCEL format, so the usual issue of the vagaries of
>>human data entry variation wasn't the issue as the data was an automated
>>report, which is run every three months. I would not have even noticed
>>this problem if I hadn't been double checking the numbers of people by
>>district. Visual inspection didn't reveal this problem - no white space
>>was obvious and the spelling was identical. Tabulation via R wouldn't have
>>detected this - I was obtaining the EXCEL totals via filter which I then
>>compared with R output. I'm hoping I can skip this step, in future, with
>>Jim's suggestion.
>>
>>regards
>>
>>Bob
>>
>>
>>
>>
>>
>>
>>>On Fri, 14 Jan 2011, David Scott wrote:
>>>
>>>>As a further note, this is a reminder that whenever you get data via
>>>>a spreadsheet the first thing to do is examine it and clean up any
>>>>problems. A basic requirement is to tabulate any categorical
>>>>variable. Spreadsheets allow any sort of data to be entered, with no
>>>>controls. My experience is that those who enter data into
>>>>spreadsheets enter all sorts of variations of what a human would
>>>>wish to treat as the same ("Open", "Open ", "open", etc.), even when
>>>>told not to.
>>>
>>>Another common problem is that they enter characters such as
>>>non-breaking space or zero-width characters: we added support for
>>>known encodings of NBSP to strip.white about five years ago.
>>>
>>>>
>>>>David Scott
>>>>
>>>>On 14/01/2011 4:03 p.m., Jim Holtman wrote:
>>>>>try strip.white=TRUE to strip out white space
>>>>>
>>>>>Sent from my iPad
>>>>>
>>>>>On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
>>>>>
>>>>>>
>>>>>>I have a frustrating issue which I am hoping someone may have a
>>>>>>suggestion
>>>>>>about.
>>>>>>
>>>>>>I am running XP and R 2.12.0 and saved an EXCEL file that I was sent
>>>>>>as a
>>>>>>csv file.
>>>>>>
>>>>>>The initial code I ran follows.
>>>>>>
>>>>>>dec<- read.csv("g://FMH/FO30122010.csv",header=T)
>>>>>>dec.open<- subset (dec, Status == "Open")
>>>>>>table(dec.open$AMHS)
>>>>>>
>>>>>>I was checking the output and noticed a difference between my manual
>>>>>>count
>>>>>>and R output. Two subject's rows were not being detected by the subset
>>>>>>command:
>>>>>>
>>>>>>For the AMHS where there was a discrepancy I then ran:
>>>>>>wm<- subset (dec, AMHS == "WM")
>>>>>>
>>>>>>The problem appears to be that there is a space before the 'Open"
>>>>>>value
>>>>>>for two indivduals, as per the example below.
>>>>>>
>>>>>>10/02/2010  Open
>>>>>>22/08/2007   Open
>>>>>>
>>>>>>Checking in EXCEL there does not appear to be a space and the format
>>>>>>is
>>>>>>the same (e.g 'general').  I resolved the problem by copying over the
>>>>>>values for the two individuals where I identified  a problem.
>>>>>>
>>>>>>Given this problem was not detected by visual scanning I would
>>>>>>appreciate
>>>>>>advice on how this problem can be detected in future without my having
>>>>>>to
>>>>>>manually check raw data against R output.
>>>>>>
>>>>>>Any assistance is appreciated,
>>>>>>
>>>>>>Bob
>>>>>>
>>>>>>______________________________________________
>>>>>>R-help at r-project.org mailing list
>>>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>PLEASE do read the posting guide
>>>>>>http://www.R-project.org/posting-guide.html
>>>>>>and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>______________________________________________
>>>>>R-help at r-project.org mailing list
>>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>PLEASE do read the posting guide
>>>>>http://www.R-project.org/posting-guide.html
>>>>>and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>>--
>>>>_________________________________________________________________
>>>>David Scott     Department of Statistics
>>>>                 The University of Auckland, PB 92019
>>>>                 Auckland 1142,    NEW ZEALAND
>>>>Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
>>>>Email:  d.scott at auckland.ac.nz,  Fax: +64 9 373 7018
>>>>
>>>>Director of Consulting, Department of Statistics
>>>>
>>>>______________________________________________
>>>>R-help at r-project.org mailing list
>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide
>>>>http://www.R-project.org/posting-guide.html
>>>>and provide commented, minimal, self-contained, reproducible code.
>>>
>>>--
>>>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>>>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>>>University of Oxford,             Tel:  +44 1865 272861 (self)
>>>1 South Parks Road,                     +44 1865 272866 (PA)
>>>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>
>>______________________________________________
>>R-help at r-project.org mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list