[R] How can I find nonstandard or control characters in a large file?

Enrico Schumann es at enricoschumann.net
Tue Dec 10 08:11:21 CET 2013


On Mon, 09 Dec 2013, andrewH <ahoerner at rprogress.org> writes:

> I have a humongous csv file containing census data, far too big to read into
> RAM. I have been trying to extract individual columns from this file using
> the colbycol package. This works for certain subsets of the columns, but not
> for others. I have not yet been able to precisely identify the problem
> columns, as there are 731 columns and running colbycol on the file on my old
> slow machine takes about 6 hours. 
>
> However, my suspicion is that there are some funky characters, either
> control characters or characters with some non-standard encoding, somewhere
> in this 14 gig file. Moreover, I am concerned that these characters may
> cause me trouble down the road even if I use a different approach to getting
> columns out of the file.
>
> Is there an r utility will search through my file without trying to read it
> all into memory at one time and find non-standard characters or misplaced
> (non-end-of-line) control characters? Or some R code to the same end?  Even
> if the real problem ultimately proves top be different, it would be helpful
> to eliminate this possibility. And this is also something I would routinely
> run on files from external sources if I had it. 
>
>  I am working in a windows XP environment, in case that makes a difference.
>
> Any help anyone could offer would be greatly appreciated.
>
> Sincerely, andrewH

You could process your file in chunks:

  f <- file("myfile.csv", open = "r")
  lines <- readLines(f, n = 10000)
  ## do something with lines
  lines <- readLines(f, n = 10000)
  ## do something with lines
  ## ....

To find 'non-standard characters' you will need to define what
'non-standard characters' are.  But perhaps ?tools:::showNonASCII, which
uses ?iconv, can help you.  (Please note the warnings and caveats on the
functions' help pages.)


-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net



More information about the R-help mailing list