[Rd] foreign::read.dbf fails to parse dbf properly
Roger@B|v@nd @end|ng |rom nhh@no
Sat Jul 30 13:27:27 CEST 2022
On Sat, 30 Jul 2022, r-devel-request using r-project.org wrote:
> Dear R developers,
> tl;dr I've been trying to read foxpro dbf files with
> foreign::read.dbf(), they weren't being read properly, I patched the
> foreign package to make it work, now what?
> Long version:
> I recently encountered unexpected behavior attempting to read dbf files
> using foreign::read.dbf() from here:
As you may have seen in the code or the help page, the first port from the
then version of shapelib: https://github.com/OSGeo/shapelib was made by
Nicholas Lewin-Koh 20 years ago. The code is largely unchanged since then.
What is your OS and R version?
Reading F1_15.DBF (Fedora 36, locally built R 4.2.1), I see:
RESPONDENT REPORT_YEA SPPLMNT_NU ROW_NUMBER ROW_SEQ
\xc2 : 350 \xe4\a:25774 NA's:25774 U : 874 U : 874
\001 : 248 C : 864 C : 864
\002 : 208 T : 846 T : 846
x : 206 \004 : 845 \004 : 845
\x85 : 197 \002 : 840 \002 : 840
(Other):24363 (Other):20813 (Other):20813
NA's : 202 NA's : 692 NA's : 692
which is why another problem may be encoding since R 4.2 on Windows
The help page does say:
"The DBF format is documented but not much adhered to. There is is
no guarantee this will read all DBF files."
'read.dbf' is based on C code from <http://shapelib.maptools.org/>
which implements the 'XBASE' specification. It can convert fields
of type '"L"' (logical), '"N"' and '"F"' (numeric and float) and
'"D"' (dates): all other field types are read as-is as character
vectors. A numeric field is read as an R integer vector if it is
encoded to have no decimals, otherwise as a numeric vector.
However, if the numbers are too large to fit into an integer
vector, it is changed to numeric. Note that is possible to read
integers that cannot be represented exactly even as doubles: this
sometimes occurs if IDs are incorrectly coded as numeric.
So pre-converting seems easier than retro-fitting, given the time since
the function was first published. Libre Office seems to see 40, and
writes 40 in a more accessible way, which can be read by read.dbf().
Using a program from GDAL (locally built 3.5.1, with its bundled shapelib,
on Fedora 36 UTF-8 locale ), https://gdal.org/programs/ogr2ogr.html,
ogr2ogr -f CSV F1_15.csv F1_15.DBF
Warning 1: One or several characters couldn't be converted correctly from
CP1252 to UTF-8. This warning will not be emitted anymore
and "(" not 40. So it doesn't seem that updating the shapelib files in
foreign would help.
In addition, there is an error under options("warn"=2L) in:
Error in read.dbf(i) :
(converted from warning) value |0| found in logical field
and possibly others which do not seem to relate to the field definition
problem you identified.
> unzipped, in UPLOADERS/FORM1/working/F1_15.DBF - and as a note, this is
> a foxpro database. I would expect the first row of the first column to
> be 40, instead I am getting "(" (realizing that "(" has a decimal ascii
> value of 40). The xbase docs indicate that this is a field of type "I"
> which is a 4-byte integer unique to foxpro, and it doesn't look like
> this case is contemplated by read.dbf()
> I made some modifications to Rdbfread.c and dbfopen.c in the foreign
> package (version 0.8-82) to add specific handling for field type "I".
> I'm not current set up to contribute directly, I don't have SVN access.
> 1. Is this patch of general interest? I'm weighing in the development
> - DO NOT fix exotic bugs that haven't bugged anyone
> - DO make small enhancements if they are badly needed
> and I feel like this is maybe a bit of an exotic lack-of-feature
> (wouldn't call it a bug), and I have no idea if this is badly needed
> (by anybody, other than myself)
> 2. if of general interest, how can I get set up with SVN credentials
> for R-packages?
> Subject: Digest Footer
> R-devel using r-project.org mailing list DIGESTED
> End of R-devel Digest, Vol 233, Issue 17
Department of Economics, Norwegian School of Economics,
Postboks 3490 Ytre Sandviken, 5045 Bergen, Norway.
e-mail: Roger.Bivand using nhh.no
More information about the R-devel