[R-sig-Geo] readOGR workaround for Japanese UTF-8 geojson

Mon Jun 29 09:32:27 CEST 2020

On Mon, 29 Jun 2020, Alan Engel wrote:

> I am working on a project https://github.com/AlanInTsukuba/jpucd that 
> involves extracting shapefiles and property data from Japanese geojson 
> files. When reading with readOGR(ibarakipath1 , encoding="UTF-8", 
> use_iconv=TRUE), I find that the subsets of cannot be written with 
> writeOGR without losing text fields that are in Japanese text. I found 
> the following workaround but wonder if there is a better way to do this.
>

Firstly, the ESRI shapefile driver should only be used for reading legacy 
files with known text encodings. They use DBF files to store attribute 
data, which should never now be used in new work. They have restrictions 
on field name length, imprecision in storing numerical data, and big 
problems in storing any text that is not ASCII (see 
https://cran.r-project.org/web/packages/rgdal/vignettes/OGR_shape_encoding.pdf).

All new projects must use more modern formats, preferably GeoPackage GPKG
http://www.geopackage.org/spec/, which resolves all of the problems 
mentioned.

If your project is using R on Windows, you  need to be aware in addition 
of 
https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/index.html 
that is that R on Windows is moving towards UTF-8 in order to reduce 
internal and cross-platform encoding problems.

Finally, you should be starting new work using the sf workflow, not 
sp/rgdal. sp/rgdal are being maintained to support their reverse 
dependencies only (and especially for spatial vector data, for which sf 
provides full support).

Roger

>
> Environment: RGui, Windows10
>
>
>
> # load ibaraki shapefiles, extract TX subset, write to geojson
>
> library(jpucd)
>
> shppath <- system.file("extdata",package="jpucd")
>
>
>
> ibarakipath1 <-
> paste(shppath,"JPGen2005CTgenlCY2000P08Ibaraki.geojson",sep="/")
>
>
>
> #^ JPGen2005CTgenlCY2000P08Ibaraki.geojson is a UTF-8 encoded geojson file
>
> #^     having Japanese names in property fields. To be able to
>
> #^    read these fields, they need to be converted (to switch-jis?).
>
> #^     The following command does this.
>
> #^ This can also be done by use_iconv=FALSE and setting
>
> #^     the encoding of the Japanese columns using Encoding(x) <- "UTF-8".
>
>
>
> ibaraki <- readOGR(ibarakipath1 , encoding="UTF-8", use_iconv=FALSE) ##
> use_iconv=TRUE
>
> ## loads so that the Japanese fields are readable but writeOGR doesn’t
> write them.
>
> head(ibaraki using data)
>
>
>
> #^ Apply Encoding(x) <- “UTF-8”
>
> for (name in colnames(ibaraki using data[,sapply(ibaraki @data, is.character)])){
>
>  Encoding(ibaraki @data[[name]]) <- "UTF-8"}
>
>
>
> #^ Get TX subset
>
> tx2000 <- ibaraki[ibaraki using data$CITY_NAME=="つくば市"|ibaraki using data$CITY_NAME=="
> 守谷町"
>
>              |ibaraki using data$CITY_NAME=="伊奈町"|ibaraki using data$CITY_NAME=="谷和原村
> ",]
>
> head(tx2000 using data)
>
>
>
> #^ Write it.
>
> dsn <- "TsukubaExpressCensusDistricts2000.geojson"
>
> writeOGR(tx2000 , dsn,layer="TsukubaExpressCensusDistricts2000" ,
> driver="GeoJSON", dataset_options = NULL,
>
> layer_options=NULL, verbose = FALSE, check_exists=NULL,
>
> overwrite_layer=FALSE, delete_dsn=FALSE, morphToESRI=NULL,
>
> encoding="UTF-8")
>
>
>
> Thank you.
>
> Alan
>
> https://alanintsukuba.github.io/
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

-- 
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en