[R] Can file size affect how na.strings operates in a read.table call?

Thu Nov 14 16:40:42 CET 2019

Hi,

I have this generic function to read ASCII data files. It is essentially a wrapper around the read.table function. My function is used in a large variety of situations and has no a priori knowledge about the data file it is asked to read. Nothing is known about file size, variable types, variable names, or data table dimensions.

One argument of my function is na.strings which is passed down to read.table.

Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by ~ 160 columns) using na.strings = c('-99', '.') with the intention of interpreting '.' and '-99'
strings as the internal missing data NA. Dots were converted to NA appropriately. However, not all -99 values in the data were interpreted as NA. In some variables, -99 were converted to NA, while in others -99 was read as a number. More surprisingly, when the data file was cut in smaller chunks (ie, by dropping either rows or columns) saved in multiple files, the function calls applied on the new data files resulted in the correct conversion of the -99 values into NAs.

In all cases, the data frames produced by read.table contained the expected number of records.

While, on face value, it appears that file size affects how the na.strings argument operates, I wondering if there is something else at play here. 

Unfortunately, I cannot share the data file for confidentiality reason but was wondering if you could suggest some checks I could perform to get to the bottom on this issue.

Thank you in advance for your help and sorry for the lack of reproducible example.