[R] Can file size affect how na.strings operates in a read.table call?

Sebastien Bihorel Seb@@t|en@B|hore| @end|ng |rom cogn|gencorp@com
Thu Nov 14 18:38:13 CET 2019


Thanks Bill and Jeff

strip.white did not change the outcomes.

However, your inputs led me to compare the raw content of the files (ie, outside of an IDE) and found difference in how the apparent -99 were stored. In the big file, some -99 are stored as floats rather than integers and thus included a decimal point and trailing zeros.

The creation of the smaller files resulted in the removal of the decimal point and trailing zeros, explaining why read.table provided the "right " response on these smaller files.

So, it looks like this is the problem and that some additional post-processing may be warranted.

Thanks for the hints.

________________________________
From: William Dunlap <wdunlap using tibco.com>
Sent: Thursday, November 14, 2019 11:51
To: Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
Cc: Sebastien Bihorel <Sebastien.Bihorel using cognigencorp.com>; r-help using r-project.org <r-help using r-project.org>
Subject: Re: [R] Can file size affect how na.strings operates in a read.table call?

read.table (and friends) also have the strip.white argument:

> s <- "A,B,C\n0,0,0\n1,-99,-99\n2,-99 ,-99\n3, -99, -99\n"
> read.csv(text=s, header=TRUE, na.strings="-99", strip.white=TRUE)
  A  B  C
1 0  0  0
2 1 NA NA
3 2 NA NA
4 3 NA NA
> read.csv(text=s, header=TRUE, na.strings="-99", strip.white=FALSE)
  A   B   C
1 0   0   0
2 1  NA  NA
3 2 -99  NA
4 3 -99 -99

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>


On Thu, Nov 14, 2019 at 8:35 AM Jeff Newmiller <jdnewmil using dcn.davis.ca.us<mailto:jdnewmil using dcn.davis.ca.us>> wrote:
Consider the following sample:

#####
s <- "A,B,C
0,0,0
1,-99,-99
2,-99 ,-99
3, -99, -99
"

dta_notok <- read.csv( text = s
                      , header=TRUE
                      , na.strings = c( "-99", "" )
                      )

dta_ok <- read.csv( text = s
                   , header=TRUE
                   , na.strings = c( "-99", " -99"
                                   , "-99 ", ""
                                   )
                   )

library(data.table)

fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) )
fdta_ok <- as.data.frame( fdt_ok )
#####

Leading and trailing spaces cause problems. The data.table::fread function
has a strip.white argument that defaults to TRUE, but the resulting object
is a data.table which has different semantics than a data.frame.

On Thu, 14 Nov 2019, Sebastien Bihorel wrote:

> The data file is a csv file. Some text variables contain spaces.
>
> "Check for extraneous spaces"
> Are there specific locations that would be more critical than others?
>
>
> ____________________________________________________________________________
> From: Jeff Newmiller <jdnewmil using dcn.davis.ca.us<mailto:jdnewmil using dcn.davis.ca.us>>
> Sent: Thursday, November 14, 2019 10:52
> To: Sebastien Bihorel <Sebastien.Bihorel using cognigencorp.com<mailto:Sebastien.Bihorel using cognigencorp.com>>; Sebastien
> Bihorel via R-help <r-help using r-project.org<mailto:r-help using r-project.org>>; r-help using r-project.org<mailto:r-help using r-project.org>
> <r-help using r-project.org<mailto:r-help using r-project.org>>
> Subject: Re: [R] Can file size affect how na.strings operates in a
> read.table call?
> Check for extraneous spaces. You may need more variations of the na.strings.
>
> On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help
> <r-help using r-project.org<mailto:r-help using r-project.org>> wrote:
> >Hi,
> >
> >I have this generic function to read ASCII data files. It is
> >essentially a wrapper around the read.table function. My function is
> >used in a large variety of situations and has no a priori knowledge
> >about the data file it is asked to read. Nothing is known about file
> >size, variable types, variable names, or data table dimensions.
> >
> >One argument of my function is na.strings which is passed down to
> >read.table.
> >
> >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
> >~ 160 columns) using na.strings = c('-99', '.') with the intention of
> >interpreting '.' and '-99'
> >strings as the internal missing data NA. Dots were converted to NA
> >appropriately. However, not all -99 values in the data were interpreted
> >as NA. In some variables, -99 were converted to NA, while in others -99
> >was read as a number. More surprisingly, when the data file was cut in
> >smaller chunks (ie, by dropping either rows or columns) saved in
> >multiple files, the function calls applied on the new data files
> >resulted in the correct conversion of the -99 values into NAs.
> >
> >In all cases, the data frames produced by read.table contained the
> >expected number of records.
> >
> >While, on face value, it appears that file size affects how the
> >na.strings argument operates, I wondering if there is something else at
> >play here.
> >
> >Unfortunately, I cannot share the data file for confidentiality reason
> >but was wondering if you could suggest some checks I could perform to
> >get to the bottom on this issue.
> >
> >Thank you in advance for your help and sorry for the lack of
> >reproducible example.
> >
> >
> >______________________________________________
> >R-help using r-project.org<mailto:R-help using r-project.org> mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil using dcn.davis.ca.us<mailto:jdnewmil using dcn.davis.ca.us>>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
______________________________________________
R-help using r-project.org<mailto:R-help using r-project.org> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]



More information about the R-help mailing list