[R] CSV value not being read as it appears

Fri Jan 14 10:31:38 CET 2011

On Fri, 14 Jan 2011, David Scott wrote:

> As a further note, this is a reminder that whenever you get data via 
> a spreadsheet the first thing to do is examine it and clean up any 
> problems. A basic requirement is to tabulate any categorical 
> variable. Spreadsheets allow any sort of data to be entered, with no 
> controls. My experience is that those who enter data into 
> spreadsheets enter all sorts of variations of what a human would 
> wish to treat as the same ("Open", "Open ", "open", etc.), even when 
> told not to.

Another common problem is that they enter characters such as 
non-breaking space or zero-width characters: we added support for 
known encodings of NBSP to strip.white about five years ago.

>
> David Scott
>
> On 14/01/2011 4:03 p.m., Jim Holtman wrote:
>> try strip.white=TRUE to strip out white space
>> 
>> Sent from my iPad
>> 
>> On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
>> 
>>> 
>>> I have a frustrating issue which I am hoping someone may have a suggestion
>>> about.
>>> 
>>> I am running XP and R 2.12.0 and saved an EXCEL file that I was sent as a
>>> csv file.
>>> 
>>> The initial code I ran follows.
>>> 
>>> dec<- read.csv("g://FMH/FO30122010.csv",header=T)
>>> dec.open<- subset (dec, Status == "Open")
>>> table(dec.open$AMHS)
>>> 
>>> I was checking the output and noticed a difference between my manual count
>>> and R output. Two subject's rows were not being detected by the subset
>>> command:
>>> 
>>> For the AMHS where there was a discrepancy I then ran:
>>> wm<- subset (dec, AMHS == "WM")
>>> 
>>> The problem appears to be that there is a space before the 'Open" value
>>> for two indivduals, as per the example below.
>>> 
>>> 10/02/2010  Open
>>> 22/08/2007   Open
>>> 
>>> Checking in EXCEL there does not appear to be a space and the format is
>>> the same (e.g 'general').  I resolved the problem by copying over the
>>> values for the two individuals where I identified  a problem.
>>> 
>>> Given this problem was not detected by visual scanning I would appreciate
>>> advice on how this problem can be detected in future without my having to
>>> manually check raw data against R output.
>>> 
>>> Any assistance is appreciated,
>>> 
>>> Bob
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> -- 
> _________________________________________________________________
> David Scott	Department of Statistics
> 		The University of Auckland, PB 92019
> 		Auckland 1142,    NEW ZEALAND
> Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
> Email:	d.scott at auckland.ac.nz,  Fax: +64 9 373 7018
>
> Director of Consulting, Department of Statistics
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595