[R] Encoding() and strsplit()

Fri Nov 7 08:23:48 CET 2008

Dear All,

Encoding() goes beyond my understanding. See the 
example. I would expect from reading the help for 
Encoding() that strsplit preserves the encoding 
for each resulting element, but for simple letters it gets lost.
Also it seems that an Encoding() cannot be 
declared for simple letters. They remain in any 
case "unknown". In paste() "latin1" seems to dominate "unknown".
What kind of characteristic of an object is the 
encoding? It does not show up as attribute and 
also str() does not give me any hint.
Where can I find some explanation regarding encoding?

Thanks

Heinz

###   Encoding() and strsplit
u <- 'abcäöü'
Encoding(u)
[1] "latin1"
Encoding(u) <- 'latin1' # to be sure about encoding
us <- strsplit(u, '')[[1]] # split in single strings
Encoding(us)
[1] "unknown" "unknown" "unknown" "latin1"  "latin1"  "latin1"
Encoding(us) <- rep('latin1', length(us))
Encoding(us)
[1] "unknown" "unknown" "unknown" "latin1"  "latin1"  "latin1"
pus <- paste(us[1], us[5], sep='')
Encoding(pus)
[1] "latin1"

Version:
  platform = i386-pc-mingw32
  arch = i386
  os = mingw32
  system = i386, mingw32
  status = Patched
  major = 2
  minor = 8.0
  year = 2008
  month = 11
  day = 04
  svn rev = 46830
  language = R
  version.string = R version 2.8.0 Patched (2008-11-04 r46830)

Windows XP (build 2600) Service Pack 2

Locale:
LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252

Search Path:
  .GlobalEnv, package:stats, package:graphics, 
package:grDevices, package:utils, 
package:datasets, package:methods, Autoloads, package:base