[R] Errors in data frames from read.table

Don MacQueen macq at llnl.gov
Mon Jul 16 17:09:42 CEST 2007


Whenever I've had this kind of problem it has been either:
    the input data file is "corrupt", by which I mean not all lines 
have the same number of fields
    I have miss-specified one of the arguments to read.table() 
(usually comment.char or quote)

Use count.fields() on an offending file to find out if all records 
have the same number of delimiters. If they don't, then look 
carefully at the ones that don't to see how they depart from the 
assumption that all rows have the same number of delimiters. Check 
for "non-standard" characters like control character sequences.

Don't know what you mean by "read.table missing delimiter 
characters". If the delimiters are there, read.table will see them. 
But if they're inside quotes (the 'quote' argument of read.table) or 
after a comment character (the 'comment.char' argument), for example, 
I wouldn't expect them to be interpreted as delimiters.

If you were to edit one of the data files outside of R, changing the 
delimiters from tilde to something else, maybe TAB, and find that it 
reads correctly, then there might be an issue with read.table(). 
Unlikely, though.

If you can find the offending rows, put them into a separate file, 
and import them into Excel, or a text editor that shows everything, 
maybe it will become obvious.

-Don

At 7:13 AM -0700 7/16/07, Pat Carroll wrote:
>Hello, all.
>
>I am working on a project with a large (~350Mb, about 5800 rows) 
>insurance claims dataset. It was supplied in a tilde(~)-delimited 
>format. I imported it into a data frame in R by setting memory.limit 
>to maximum (4Gb) for my computer and using read.table.
>
>The resulting data frame had 10 bad rows. The errors appear due to 
>read.table missing delimiter characters, with multiple data being 
>imported into the same cell, then the remainder of the row and the 
>next run together and garbled due to the reading frame shift 
>(example: a single cell might contain: <datum>~ ~ <datum> ~<datum>, 
>after which all the cells of the row and the next are wrong).
>
>To replicate, I tried the same import procedure on a smaller 
>demographics data set from the same supplier- only about 1Mb, and 
>got the same kinds of errors (5 bad rows in about 3500). I also 
>imported as much of the file as Excel would hold and cross-checked, 
>Excel did not produce the same errors but can't handle the entire 
>file. I have used read.table on a number of other formats (mainly 
>csv and tab-delimited) without such problems; so far it appears 
>there's something different about these files that produces the 
>errors but I can't see what it would be.
>
>Does anyone have any thoughts about what is going wrong? And is 
>there a way, short of manual correction, for fixing it?
>
>Thanks for all help,
>~Pat.
>
>
>Pat Carroll.
>what matters most is how well you walk through the fire.
>bukowski.
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.


-- 
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062



More information about the R-help mailing list