[R] Reading PDF files with German umlauts using tabulizer
Wolfgang Grond
grond @end|ng |rom number|@nd@de
Tue Sep 6 11:39:52 CEST 2022
Dear all,
I have some trouble with reading PDF files in German language.
I want to extract text and tables with the tabulizer package, and every
things goes well as long as I read English texts.
When I try the same code
text <- extract_text(file = "Pub_001.pdf")
with documents in German language
German umlauts are not recognized.
They are either replaced by a combination of characters.
Instead of
"Entmischung und Kristallisation in Gläsern des Systems"
--
I get
"Entmischung und Kristallisation in GHisern des Systems"
--
or replaced by ascii like this
instead of
"In Gläsern des Systems"
-
I get
"In Glasern des Systems"
-
Opening the file with Adobe Reader tells me that encoding is "Ansi"
Is there a way to read this file correctly?
Thanks in advance for any idea.
Regards
More information about the R-help
mailing list