[Bioc-devel] Biostrings: unicode characters

Mon Feb 24 14:28:50 CET 2020

Dear colleagues,

Apart from my Bioconductor packages, I am also the maintainer of the 
CRAN package 'apcluster'. This package's vignette includes an example in 
which biological sequences are clustered. To this end, it uses the 
'Biostrings' package. It seems that the latest version of the 
'Biostrings' package now uses a unicode ellipsis character instead of 
three dots for shortening overly long sequences. This has caused my 
'apcluster' vignette to fail in the Latin1 locale that is used by one of 
the CRAN build servers. Please find a message by Kurt Hornik enclosed 
below that makes the point in more detail. (thanks again, Kurt, for 
spotting this and for going so deeply into the issue!) However, even in 
a UTF-8 locale, the ellipsis character causes a warning if I process the 
output in my LaTeX-/Rnw-based 'knitr' vignette.

My question: Is there any special measure that I can take to counteract 
this issue? (e.g. like loading \usepackage[xxx]{inputenc} in the 
vignette) Is there a way that the users can revert to the old-style dots 
for cases like mine?

Any help is gratefully appreciated, thanks so much in advance!

Best regards,
Ulrich

--

[...]

But the issue seems to be the following.  Your vignette has

<<LoadCh22Promoters>>=
library(Biostrings)
filepath <- system.file("examples", "ch22Promoters.fasta",
                         package="apcluster")
ch22Promoters <- readDNAStringSet(filepath)
ch22Promoters
@

when running this in a Latin1 locale, the last line gives

R> ch22Promoters
DNAStringSet object of length 150:
       width seq                                             names
Warning in paste0(as.character(subseq(x, start = 1, width = w1)), compact_ellipsis,  :
   strings not representable in native encoding will be translated to UTF-8
Calls: <Anonymous> ... .XStringSet.show_frame_line -> toSeqSnippet -> paste0
   [1]  1000 AGACTTAAGGGACCTGGTCACCA<U+2026>GCGCCCGTGTGCGCATGCGCAGC NM_001169111
   [2]  1000 CCCGGCTAATTTTTTTGTATTTT<U+2026>CGCCGCGGAGTCCGGGCGAGGTG NM_012324
   [3]  1000 CACATGTGCCCTCTGGGCCTGGT<U+2026>CAGTGCAAACGCAGCGCCAGACA NM_144704
   [4]  1000 AAGCATGGTGGGATTGGCACAGG<U+2026>GCTGGGAATGGTCCCGCGGCTCC NM_002473
   [5]  1000 TTTAGAGAACTGGGTCTTGCTAT<U+2026>ATGGACAGAGCCCAGCGGGAGCG NM_001184970
   ...   ... ...
[146]  1000 TCCGCCTCCTGGGTTCAAGCAAT<U+2026>ACGGGTCGGGGAGGGGCAGTAAG NM_032608
[147]  1000 ACTAAACTTAGTATATTATACTT<U+2026>GACCTCGCGGGTGGGCGGGGCCT NM_003560
[148]  1000 AGGATCACATCAGCTAACAGACT<U+2026>GGTCATTCAACCCTAGATCCACC NM_001166242
[149]  1000 AGGCAGGAGAATCGATTGAACCC<U+2026>TGGGTGGATCATGAGGTCAGGAG NM_001165877
[150]  1000 GCTCAGGATGAAATAGGCCCCCA<U+2026>GTCATGAGCTGCTGGGAAGTTGT NM_145343

where the problem seems to be in Biostrings using BioC doing

   XString-class.R:compact_ellipsis <- rawToChar(as.raw(c(0xe2, 0x80, 0xa6)))

which yields an UTF-8 sequence with no marked UTF-8 encoding, which
spells trouble in a Latin1 locale.

Can you please discuss this with the Biostrings maintainers?  Something
like

   enc2utf8(rawToChar(as.raw(c(0xe2, 0x80, 0xa6))))

or

   "\u2026"

will get at least the encoding right, but I am never sure whether the
latter is a good idea in package code, and in general it would be best
to use the compact ellipsis only in a UTF-8 locale ...

Best
-k