[R] Reading newlines with read.table?
Allan Engelhardt
allane at cybaea.com
Fri Jun 4 18:07:36 CEST 2010
I ended up pre-processing the files outside of R using a script along
the lines of
#!/bin/bash
for f in *_table_extract_*.txt; do
echo -n "Processing $f..."
o="${f}.xz"
iconv -f "UTF-16LE" -t "UTF-8" $f | \
tail -c +4 | \
perl -l012 -015 -pe 's/\n//g' | \
perl -ne 'print if (!m{\A \( \d+ \s row\(s\) \s affected \) \s*
\z}ixms && !m{\A \s* \z}xms)' | \
xz -7 > $o
echo "done."
done
Ugly, but it worked for me. You can change the first perl regular
expression to do different things with line terminating \n versus
in-field \n characters but I just dropped them all. The tail command
drops the byte-order-mark (which we do not need for utf-8) and the
second perl command drops blanks and a stupid SQL tool output.
Thanks to Prof. Brian Ripley who, essentially, pointed out that with
embedded linefeed characters my file was a binary file and not really a
text file. Her Majesty's government respectfully begs to disagree [1]
but that's the R definition so we'll use it on this list.
Allan
[1] Original data sets described at
http://www.hm-treasury.gov.uk/psr_coins_data.htm and downloaded from
http://data.gov.uk/dataset/coins (hint: you'll need p7zip to unpack them
on a Linux box).
On 04/06/10 14:49, Allan Engelhardt wrote:
> I have a text file that is UTF-16LE encoded with CRLF line endings and
> '@' as field separators that I want to read in R on a Linux system.
> Which would be fine as
>
> read.table("foo.txt", file.encoding = "UTF-16LE", sep = "@", ...)
>
> *except* that the data may contain the LF character which R treats as
> end-of-line and then barfs that there are too few elements on that line.
>
> Any suggestions for how to process this one efficiently in R? [...]
More information about the R-help
mailing list