[R] How can I find nonstandard or control characters in a large file?

Earl F Glynn efglynn at gmail.com
Tue Dec 10 16:27:02 CET 2013


andrewH wrote:

> However, my suspicion is that there are some funky characters, either
> control characters or characters with some non-standard encoding, somewhere
> in this 14 gig file. Moreover, I am concerned that these characters may
> cause me trouble down the road even if I use a different approach to getting
> columns out of the file.

This is not an R solution, but here's a Windows utility I wrote to 
produce a table of frequency counts for all hex characters x00 to xFF in 
a file.

http://www.efg2.com/Lab/OtherProjects/CharCount.ZIP

Normally, you'll want to scrutinize anything below x20 or above x7F, 
since ASCII printable characters are in the range x20 to x7E. You can 
see how many tab (x09) characters are in the file, and whether the line 
endings are from Linux (x0A) or Windows (paired x0A and x0D).


The ZIP includes Delphi source code, but provides a Windows executable. 
  I made a change several months ago to allow drag-and-drop, so you can 
just drop the file on the application to have the characters counted. 
Just run the EXE after unzipping.  No installation is needed.

Once you find problems characters in the file, you can read the file as 
character data and use sub/gsub or other tools to remove or alter 
problem characters.

efg
Earl F Glynn
UMKC School of Medicine
Center for Health Insights



More information about the R-help mailing list