[Rd] proposal: 'dev.capabilities()' can also query Unicode capabilities of current graphics device

Sat Sep 23 22:43:47 CEST 2023

On Wed, 20 Sep 2023 12:39:50 +0200
Martin Maechler <maechler using stat.math.ethz.ch> wrote:

> The problem is that some pdf *viewers*,
> notably `evince` on Fedora Linux, for several years now,
> do *not* show *some* of the UTF-8 glyphs because they do not use
> the correct fonts 

One more problem that makes it nontrivial to use Unicode with pdf() is
the graphics device not knowing some of the font metrics:

x <- '\u410\u411\u412'
pdf()
plot(1:10, main = x)
# Warning messages:
# 1: In title(...) : font width unknown for character 0xb0
# 2: In title(...) : font width unknown for character 0xe4
# 3: In title(...) : font width unknown for character 0xfc
# 4: In title(...) : font width unknown for character 0x7f
dev.off()

In the resulting PDF file, the three letters are visible, at least in
Evince 3.38.2, but they are all positioned in the same space.

I understand that this is strictly speaking not pdf()'s fault
(grDevices contains the font metrics for all standard Adobe fonts and a
few more), but I'm not sure what to do as a user. Should I call
pdfFonts(...), declaring a font with all symbols I need? Where does one
even get Type-1 Cyrillic Helvetica (or any other font) with separate
font metrics files for use with pdf()?

Actually, the wrong number of sometimes random character codes reminds
me of stack garbage. In src/library/grDevices/src/devPS.c, function
static double PostScriptStringWidth, there's this bit of code:

	if(!strIsASCII((char *) str) &&
	   /*
	    * Every fifth font is a symbol font:
	    * see postscriptFonts()
	    */
	   (face % 5) != 0) {
	    R_CheckStack2(strlen((char *)str)+1);
	    char buff[strlen((char *)str)+1];
	    /* Output string cannot be longer */
	    mbcsToSbcs((char *)str, buff, encoding, enc);
	    str1 = (unsigned char *)buff;
	}

Later the characters in str1 are iterated over in order to calculate
the total width of the string. I didn't notice this myself until I saw
in the debugger that after a few iterations of the loop, the contents
of str1 are completely different from the result of mbcsToSbcs((char
*)str, buff, encoding, enc), and went to investigate. Only after the
debugger told me that there's no variable called "buff" I realised that
the VLA pointed to by str1 no longer exists.

--- src/library/grDevices/src/devPS.c	(revision 85214)
+++ src/library/grDevices/src/devPS.c	(working copy)
@@ -721,6 +721,8 @@
     unsigned char p1, p2;
 
     int status;
+    /* May be about to allocate */
+    void *alloc = vmaxget();
     if(!metrics && (face % 5) != 0) {
 	/* This is the CID font case, and should only happen for
 	   non-symbol fonts.  So we assume monospaced with multipliers.
@@ -755,9 +757,8 @@
 	    * Every fifth font is a symbol font:
 	    * see postscriptFonts()
 	    */
-	   (face % 5) != 0) {
-	    R_CheckStack2(strlen((char *)str)+1);
-	    char buff[strlen((char *)str)+1];
+	   (face % 5) != 0 && metrics) {
+	    char *buff = R_alloc(strlen((char *)str)+1, 1);
 	    /* Output string cannot be longer */
 	    mbcsToSbcs((char *)str, buff, encoding, enc);
 	    str1 = (unsigned char *)buff;
@@ -792,6 +793,7 @@
 		}
 	}
     }
+    vmaxset(alloc);
     return 0.001 * sum;
 }
 

 
After this patch, I'm consistently getting the right character codes in
the warnings, but I still don't know how to set up the font metrics.

-- 
Best regards,
Ivan