[R] File coding problem: how to read a windows-1252 encoded file

peter dalgaard pdalgd at gmail.com
Tue May 13 16:10:47 CEST 2014


Hi Bob, Long time no see.

The following works for me. In general, I think it is tricky to rely on encodings to be passed on to the appropriate agent, so try to be as specific as possible about it.

con <- url("ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt",
     encoding="Latin1")
SpCodes=read.fwf(con,
                widths=c(7,6,51,51), skip=6, n=5, header=F,
                stringsAsFactors=F)

AFAICT, the root cause is that encoding= is passed by read.fwf() to read.table(), once the columns are split out, but not to the file connection used to get the data for splitting.

It also worked to get the whole enchilada using readLines, convert with iconv() and then use read.fwf on a textConnection to the converted lines.

And, BTW, even though encoding names vary between platforms, "ISO-8859" is almost surely wrong, because there is "ISO-8859-1", "ISO-8859-2", ...

- Peter


On 13 May 2014, at 15:35 , Bob O'Hara <rni.boh at gmail.com> wrote:

> I'm trying to read a text file (actually the ftp file in command below),
> and I'm getting an error:
> 
>> SpCodes=read.fwf("
> ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt",
> +                  widths=c(7,6,51,51), skip=6, n=5, header=F,
> stringsAsFactors=F)
> Error in substring(x, first, last) :
>  invalid multibyte string at '<e0> vent'
> 
> The problem is caused by"Dendrocygne à ventre noir", which has a French
> character which seems to be causing the problems: there are more throughout
> the file (and I want to read the whole file: I'm picking uot bits above to
> make it easier), so I can't manually delete this. The file is apparently in
> the ISO-8859 format (or it might be windows-1252), but using that in either
> encoding= or fileEncoding= doesn't work:
> 
> SpCodes=read.fwf("
> ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt",
>                 widths=c(7,6,51,51), skip=6, n=5, header=F,
> stringsAsFactors=F, fileEncoding="ISO-8859")
> 
> Can anyone suggest a solution? In case it helps, here's my session info:
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
> 
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8
> LC_PAPER=en_GB.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> loaded via a namespace (and not attached):
> [1] tools_3.1.0
>> 
> 
> 
> -- 
> Bob O'Hara
> 
> Biodiversity and Climate Research Centre
> Senckenberganlage 25
> D-60325 Frankfurt am Main,
> Germany
> 
> Tel: +49 69 798 40226
> Mobile: +49 1515 888 5440
> WWW:   http://www.bik-f.de/root/index.php?page_id=219
> Blog: http://occamstypewriter.org/boboh/
> Journal of Negative Results - EEB: www.jnr-eeb.org
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list