[R] read_xlsx(readxl) apparently mangling some data input

Tue Feb 4 14:56:21 CET 2020

This list can be priceless (and has taught me so much over, hm, over a decade certainly now!)

Thanks both: makes perfect sense (of course) and shows my naivety in the way I was thinking about this.  
I'm intrigued that it's LibreOffice actually using lower precision that avoids the issue that was 
puzzling me, again, makes perfect sense.

Further, Peter's 
   ix <- !(is.na(xn)) & xn%%1 != 0
is delicious and exactly the sort of thing I don't see unaided.  I know I would have done something 
horribly more clumsy.  It's also the sort of little revelation about the potential power of %% that
I think I _will_ remember and no doubt find myself using again in the future.

Thanks both, huge help to me and, as I suspected, a wasteful github issue report prevented!

Chris

----- Original Message -----
> From: "PIKAL Petr" <petr.pikal using precheza.cz>
> To: "Chris Evans" <chrishold using psyctc.org>, "R-help Mailing List" <r-help using r-project.org>
> Sent: Tuesday, 4 February, 2020 13:39:31
> Subject: RE: read_xlsx(readxl) apparently mangling some data input

> Hi
> 
> Floating point representation
> 
> I prepared excel file with arbitrary first row and second row
> 
> 45.65 and 45.65/5
> 
> The division result should be 9.13 (exactly), but based on floation point
> representation in binary computers (FAQ 7.31) it results in 9.129999999...
> However Excel shows exact value (9.13) although internally it stores this
> 9.129999.. Probably they do not want to disturb its audience.
> 
> Therefore read_xlsx reads it correctly
> 
>> temp <- read_xlsx(file.choose())
>> temp
> # A tibble: 2 x 2
>     a1 a2
>  <dbl> <chr>
> 1  12   8,8,10
> 2  45.6 9.129999999999999
>> as.data.frame(temp)
>     a1                a2
> 1 12.00            8,8,10
> 2 45.65 9.129999999999999
> 
> Cheers
> Petr
> 
>> -----Original Message-----
>> From: R-help <r-help-bounces using r-project.org> On Behalf Of Chris Evans
>> Sent: Tuesday, February 4, 2020 1:07 PM
>> To: R-help Mailing List <r-help using r-project.org>
>> Subject: [R] read_xlsx(readxl) apparently mangling some data input
>> 
>> This is a very odd error I'm hitting using read_xlsx from the readxl
> package
>> (version 1.3.1) with R version 3.6.2 (2019-12-12) , platform
> x86_64-pc-linux-
>> gnu (and updated Ubuntu 18.04). I have some largeish Excel spreadsheets
>> that contain clinical data. I can't share the entire raw data but I think
> I can
>> share the specific problem columns as Excel files, but not via the list as
> I'm
>> sure it rightly rejects such attachments.
>> 
>> The particular column contains entries like
>> 1
>> 1, 14
>> 
>> 1.14
>> 
>> That's to say it's a column that can have empty cells, or entries which
> should
>> be integers (a limited range of them) but cells may have multiple integers
>> and the data entry means that people use various separators, commas, full
>> stops and occasionally semi-colons or colons and all with or without
> various
>> amounts of space.
>> 
>> I thought this would be easy to handle but this illustrates the issue I'm
>> hitting:
>> 
>> > unique(read_xlsx("Book1.xlsx", col_types = "text"))
>> # A tibble: 18 x 1
>>    NOWARN
>>    <chr>
>>  1 NA
>>  2 14
>>  3 8,12,14
>>  4 13
>>  5 58
>>  6 9
>>  7 9.1300000000000008
>>  8 11
>>  9 11.14
>> 10 10
>> 11 10.14
>> 12 9.14
>> 13 13.14
>> 14 9 ,13
>> 15 9.11
>> 16 1
>> 17 1.1399999999999999
>> 18 1, 14
>> 
>> That's reading from a single column, 981 row (including column header)
>> Excel xlsx file in an up to date Windoze 10 Professional running in a VM
> on
>> the Ubuntu machine.
>> 
>> I created that file (which I can share) by copying the data from the full
> file to
>> a new Excel spreadsheet (M$ Orifice "Professional Plus 2019" "Version
> 1912"
>> "Build 12325.20344 Click-to-run" to an empty new Excel file and using the
>> default save_as.  The clinical data files were created in, and updated in,
>> versions of Excel that I can't access but the file was certainly created
> first
>> between two years and three months before now so probably with different
>> versions of Excel and probably in a Spanish or Catalan M$ locale.
>> 
>> The weird thing is that looking at the Excel cells that created those
>> "9.1300000000000008" and "1.1399999999999999" entries they show "9.13"
>> and "1.14" (respectively!).  They continue to show those values plus many
>> trailing zeroes if I use Excel formatting to ask for 20 decimal places (I
> get less
>> of course, but no arbitrary terminal rounding digit).
>> 
>> It appears to me that read_xlxs() is only applying the "col_types =
> "text""
>> argument _after_ reading the column freely, reading each cell guessing the
>> type by its contents and so ending up with numeric values for "9.13" and
>> "1.14" which are then picking up rounding errors and being forced to
>> character after that.  I say that the reading would appear to be free
> across all
>> cells in the column as there are entries of "8, 12, 14" coming before
> these
>> problem entries:
>> 
>> > tmp <- read_xlsx("Book1.xlsx", col_types = "text")
>> > grep("1.1399999999999999", tmp$NOWARN, fixed = TRUE)
>> [1] 932 948 954
>> > grep("9.1300000000000008", tmp$NOWARN, fixed = TRUE)
>>  [1]  73 189 190 271 272 390 511 645 686 710 744 830 899
>> > tmp$NOWARN[20]
>> [1] "8,12,14"
>> 
>> This seems completely bizarre to me.  I find it very hard to believe that
>> read_xlsx() would guess content class (type) freely by for each individual
>> entry and only apply the col_types argument after doing that as that would
>> seem likely to be incredibly inefficient for really big spreadsheets. It
> seems
>> equally hard to believe that it would then create rounding errors (for
> some
>> guessed numerics like 9.13 and 1.14 but not for others like 11.4).
> However,
>> my guess would appear to fit the results and I am only guessing because
> I'm
>> sure my programming comprehension isn't good enough to read into the
>> sources to actually work out how the function works.
>> 
>> To make things more interesting, and to suggest that at least some of the
>> problem is with Excel is that when I use LibreOffice (in Ubuntu) created a
>> Excel file in the same way, i.e. open the clinical Excel file but in
> LibreOffice,
>> copy and paste the same column into a new LibreOffice calc spreadsheet
>> and save as xlsx, tmp.xlsx, I get this:
>> 
>> > unique(read_xlsx("tmp.xlsx", col_types = "text"))
>> # A tibble: 18 x 1
>>    NOWARN
>>    <chr>
>>  1 NA
>>  2 14
>>  3 8,12,14
>>  4 13
>>  5 58
>>  6 9
>>  7 9.13
>>  8 11
>>  9 11.14
>> 10 10
>> 11 10.14
>> 12 9.14
>> 13 13.14
>> 14 9 ,13
>> 15 9.11
>> 16 1
>> 17 1.14
>> 18 1, 14
>> 
>> Exactly what I think I should be seeing. I was working in Rstudio but get
>> exactly the same in a new R terminal session with only readxl loaded so I
>> don't think this is any weird environment or other clash.
>> 
>> Obviously I can, though not terribly easily for a fully generic fix, catch
> these
>> weird rounding errors and correct them, I am sure can also report this as
> a
>> suspected bug to the maintainer through the github issues system but I
>> wanted to check here whether anyone could see something I'm missing as
>> I'm really a (clinically retired) therapist and doctor, now full time
> researcher
>> and I'm not a professional statistician or programmer.
>> 
>> TIA,
>> 
>> Chris
>> 
>> 
>> 
>> --
>> Chris Evans <chris using psyctc.org> Visiting Professor, University of Sheffield
>> <chris.evans using sheffield.ac.uk> I do some consultation work for the
>> University of Roehampton <chris.evans using roehampton.ac.uk> and other
>> places but <chris using psyctc.org> remains my main Email address.  I have a
>> work web site at:
>>    https://www.psyctc.org/psyctc/
>> and a site I manage for CORE and CORE system trust at:
>>    http://www.coresystemtrust.org.uk/
>> I have "semigrated" to France, see:
>>    https://www.psyctc.org/pelerinage2016/semigrating-to-france/
>> That page will also take you to my blog which started with earlier joys in
>> France and Spain!
>> 
>> If you want to book to talk, I am trying to keep that to Thursdays and my
>> diary is at:
>>    https://www.psyctc.org/pelerinage2016/ceworkdiary/
>> Beware: French time, generally an hour ahead of UK.
>> 
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
> > and provide commented, minimal, self-contained, reproducible code.

-- 
Chris Evans <chris using psyctc.org> Visiting Professor, University of Sheffield <chris.evans using sheffield.ac.uk>
I do some consultation work for the University of Roehampton <chris.evans using roehampton.ac.uk> and other places
but <chris using psyctc.org> remains my main Email address.  I have a work web site at:
   https://www.psyctc.org/psyctc/
and a site I manage for CORE and CORE system trust at:
   http://www.coresystemtrust.org.uk/
I have "semigrated" to France, see: 
   https://www.psyctc.org/pelerinage2016/semigrating-to-france/ 
That page will also take you to my blog which started with earlier joys in France and Spain!

If you want to book to talk, I am trying to keep that to Thursdays and my diary is at:
   https://www.psyctc.org/pelerinage2016/ceworkdiary/
Beware: French time, generally an hour ahead of UK.