[R] Errors in data frames from read.table

(Ted Harding) ted.harding at nessie.mcc.ac.uk
Mon Jul 16 17:00:26 CEST 2007


On 16-Jul-07 14:13:09, Pat Carroll wrote:
> Hello, all.
> 
> I am working on a project with a large (~350Mb, about 5800 rows)
> insurance claims dataset. It was supplied in a tilde(~)-delimited
> format. I imported it into a data frame in R by setting memory.limit to
> maximum (4Gb) for my computer and using read.table. 

I had a similar problem put to me some time back, and eventually
solved it by going in with a scalpel. It turned out that there
was a problem with muddling "End-of-Line" with field delimiter
in creating the file. And the file came out of Excel in the first
place ... (did yours?). Quite why excell made this particular
mess of it remains a mystery.

I note that your file size is "350Mb" and "about 5800 rows".
Doing some arithmetic on that:

350 * 1024 * 1024 = 367,001,600 bytes

367001600/5800 = 63276.14 bytes per row.

This (given your "about"s) looks to me dangerously close to
65536 = 64K, and this may be a limit on what Excel can handle?

Just a thought ...
Ted.

> The resulting data frame had 10 bad rows. The errors appear due to
> read.table missing delimiter characters, with multiple data being
> imported into the same cell, then the remainder of the row and the next
> run together and garbled due to the reading frame shift (example: a
> single cell might contain: <datum>~ ~ <datum> ~<datum>, after which all
> the cells of the row and the next are wrong). 
> 
> To replicate, I tried the same import procedure on a smaller
> demographics data set from the same supplier- only about 1Mb, and got
> the same kinds of errors (5 bad rows in about 3500). I also imported as
> much of the file as Excel would hold and cross-checked, Excel did not
> produce the same errors but can't handle the entire file. I have used
> read.table on a number of other formats (mainly csv and tab-delimited)
> without such problems; so far it appears there's something different
> about these files that produce
> s the errors but I can't see what it would be.
> 
> Does anyone have any thoughts about what is going wrong? And is there a
> way, short of manual correction, for fixing it?
> 
> Thanks for all help,
> ~Pat.
> 
> 
> Pat Carroll. 
> what matters most is how well you walk through the fire. 
> bukowski.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 16-Jul-07                                       Time: 15:59:48
------------------------------ XFMail ------------------------------



More information about the R-help mailing list