[R] Diagnostic and helper functions for defective & hard-to-import files

Wed Jan 29 08:12:38 CET 2014

On Jan 28, 2014 at 8:56pm, David Winsemius wrote:

On Jan 28, 2014, at 8:43 PM, andrewH wrote:

> Hi Folks!
> I have been writing a small set of utilities for dealing with files that
> are
> hard to open correctly for one reason or another, especially because they
> are too big for memory, non-rectangular, or contain odd characters or
> unexpected codings, or all of these things together. Today it suddenly hit
> me that this has probably been done, done better, and upgraded to package
> form a dozen times already. There were pointers to a couple functions
> useful
> in this regard in the Core Import/Export document.  But my effort to come
> up
> with search terms that were productive of such packages was unsuccessful.

I don't know of a package to do that. You know the quote from that Russian
author whose name I am forgetting (in "Anna Karinena" perhaps) about happy
families being all the same but unhappy families being impossible to
classify. I think it applies to datasets as well. There are too many
different dataset pathologies to allow a neat packaging approach.

My approach has been to study the options in read.table very carefully and
if that is insufficient look at either readLines or scan as options. It is
very useful to be able to use `count.fields` with different parameter
settings of "quotes" and comment.char". Wrapping it in table() can deliver a
very compact, useful result.

And don't forget to search the Archives if you have a regular but
non-rectangular arrangement.

David Winsemius
Alameda, CA, USA 

Thanks, David! 

You have quickly summarized a set of techniques that it took me a long time
to learn (much of it spent disentangling the truth from various
misconceptions about the data-reading process. I don't think I have very
much to add to your list, but as always, the effectiveness depends on
correct implementation, and I have made a _lot_  of mistake in trying to
implement these in the past. Moreover, all these thing become much more
complicated if the file is too big to just read into a data frame. I am
working with Census records right now, and my primary data file is a 14 gig
csv that had me tearing my hair out trying to read it and pull out the
variables I have needed at any given moment. 

I finally did get it read and the right subset extracted, but it was a
pretty empirical process - I would just keep trying things that didn't work
until I found something that did, often not quite understanding why my
previous efforts had failed.  I know that If I have to do this again six
months from now I will have no idea how I did it. So I wanted to reduce the
things that worked to functions and set up a sort of decision tree that I
could work through to find and correct at least the more common problems.
But I was hoping -- am still hoping, actually -- to find that someone else
has already done this so I could get back to my real work. It seems like the
sort of thing that could easily be buried in the 100+ pages of documentation
of one of the big utility packages like Hmisc, MASS or car. 

I have often wished there was a data manipulation and import/export task
view, with a purview to cover things like what I am talking about here, the
contents of Phil Spector's book, and packages like Hadley Wickham's plyr. 

Warmest regards, andrewH

--
View this message in context: http://r.789695.n4.nabble.com/Diagnostic-and-helper-functions-for-defective-hard-to-import-files-tp4684357p4684364.html
Sent from the R help mailing list archive at Nabble.com.