[R] Help troubleshooting silent failure reading huge file with read.delim

David Winsemius dwinsemius at comcast.net
Wed Oct 6 11:30:13 CEST 2010


Apologies for the blank post. Too little caffeine at 5:30 AM.

On Oct 6, 2010, at 3:15 AM, Earl F. Glynn wrote:

>
> I am trying to read a tab-delimited 1.25 GB file of 4,115,119  
> records each
> with 52 fields.
>
> I am using R 2.11.0 on a 64-bit Windows 7 machine with 8 GB memory.
>
> I have tried the two following statements with the same results:
>
> d <- read.delim(filename, as.is=TRUE)
>
> d <- read.delim(filename, as.is=TRUE, nrows=4200000)
>
> I have tried starting R with this parameter but that changed nothing:
> --max-mem-size=6GB
>
> Everything appeared to have worked fine until I studied frequency  
> counts of
> the fields and realized data were missing.
>
>> dim(d)
> [1] 3388444      52
>
> R read 3,388,444 records and missed 726,754 records.  There were no  
> error
> messages or exceptions.  I plotted a chart using the data and later
> discovered not all the data were represented in the chart.
>
> R didn't just read the first 3,388,444 records and quit.
>
> Here's what I believe happened (based on frequency counts of the  
> first field
> in the data.frame from R, and independently from another source):
> * R read the first 1,866,296 records and then skipped 419,340 records.
> * Next, R read 1,325,552 records and skipped 307,414 records.
> * R read the last 196,596 records without any problems.
>
> Questions:
>
> Is there some memory-related parameter that I should adjust that might
> explain the observed details above?

Can't think of any.
>
> Shouldn't read.delim catch this failure instead of being silent about
> dropping data?

More likely you have mismatched quotes in your file and some fields  
are accumulating large amounts of text. You should do some tabulations  
on your text fields with nchar-based functions.

>
> Thanks for any help with this.
>
> Earl F Glynn
> Overland Park, KS

-- 

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list