[R] Read Unicode text (*.txt)

William Dunlap wdun|@p @end|ng |rom t|bco@com
Tue Jul 2 05:39:38 CEST 2019


If I recall correctly, Excel's 'Unicode' used to mean "UTF-16", which R's
scan() did not recognize without a hint.  The relevant argument is
fileEncoding, not encoding.  UTF-16 files generally have lots of null bytes
and UTF-8 files have no null bytes and if you try to read UTF-16 as UTF-8
you get the embedded-null warning.

I don't have Excel installed, but the following example is from R-3.5.2 on
a Linux box.

> f8 <- file(tf8 <- tempfile(), open="w", encoding="UTF-8")
> cat("\u0416;zh\n", file=f8); close(f8)
> readBin(tf8, what="raw", n=file.size(tf8))
[1] d0 96 3b 7a 68 0a
>
> f16 <- file(tf16 <- tempfile(), open="w", encoding="UTF-16")
> cat("\u0416;zh\n", file=f16); close(f16)
> readBin(tf16, what="raw", n=file.size(tf16))
 [1] ff fe 16 04 3b 00 7a 00 68 00 0a 00
>
> read.csv(tf8, sep=";", header=FALSE)
  V1 V2
1  Ж zh
> read.csv(tf16, sep=";", header=FALSE)
Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec,  :
  invalid multibyte string at '<ff><fe>'
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 1 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 2 appears to contain embedded nulls
3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on
'/tmp/RtmpzfG6eG/file40e53389f40e'
> read.csv(tf16, sep=";", header=FALSE, fileEncoding="UTF-16")
  V1 V2
1  Ж zh
.
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Mon, Jul 1, 2019 at 8:12 PM Abby Spurdle <spurdle.a using gmail.com> wrote:

> > Don't be so US-centric, Abby... how do you know that javad's version of
> Excel doesn't default to using semicolons?
>
> I don't.
>
> However, Comma-Separated Values (CSV) are, comma separated, by definition.
> So, if the files use semicolons, then...
>
> Also, the use of the wrong sep="my.delim" argument is the most likely cause
> of single column output.
>
> However, you're right, I don't really know, I'm just guessing...
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list