[R] How can I find nonstandard or control characters in a large file?
Earl F Glynn
efglynn at gmail.com
Tue Dec 10 16:27:02 CET 2013
andrewH wrote:
> However, my suspicion is that there are some funky characters, either
> control characters or characters with some non-standard encoding, somewhere
> in this 14 gig file. Moreover, I am concerned that these characters may
> cause me trouble down the road even if I use a different approach to getting
> columns out of the file.
This is not an R solution, but here's a Windows utility I wrote to
produce a table of frequency counts for all hex characters x00 to xFF in
a file.
http://www.efg2.com/Lab/OtherProjects/CharCount.ZIP
Normally, you'll want to scrutinize anything below x20 or above x7F,
since ASCII printable characters are in the range x20 to x7E. You can
see how many tab (x09) characters are in the file, and whether the line
endings are from Linux (x0A) or Windows (paired x0A and x0D).
The ZIP includes Delphi source code, but provides a Windows executable.
I made a change several months ago to allow drag-and-drop, so you can
just drop the file on the application to have the characters counted.
Just run the EXE after unzipping. No installation is needed.
Once you find problems characters in the file, you can read the file as
character data and use sub/gsub or other tools to remove or alter
problem characters.
efg
Earl F Glynn
UMKC School of Medicine
Center for Health Insights
More information about the R-help
mailing list