[R] Reading PDF files with German umlauts using tabulizer
Kimmo Elo
k|mmo@e|o @end|ng |rom utu@||
Wed Sep 7 10:03:06 CEST 2022
Hi!
The package "tabulizer" seems to be removed from package repositories,
so it is a bit hard to test.
I found the documentation and the syntax of "extract_text" is:
extract_text(file, pages = NULL, area = NULL, password = NULL,
encoding = NULL, copy = FALSE)
So have you tried to set the "encoding" parameter?
HTH,
Kimmo
ti, 2022-09-06 kello 11:39 +0200, Wolfgang Grond kirjoitti:
> Dear all,
>
> I have some trouble with reading PDF files in German language.
>
> I want to extract text and tables with the tabulizer package, and
> every
> things goes well as long as I read English texts.
>
> When I try the same code
>
> text <- extract_text(file = "Pub_001.pdf")
>
> with documents in German language
>
> German umlauts are not recognized.
>
> They are either replaced by a combination of characters.
>
> Instead of
>
> "Entmischung und Kristallisation in Gläsern des Systems"
> --
> I get
>
> "Entmischung und Kristallisation in GHisern des Systems"
> --
>
> or replaced by ascii like this
>
> instead of
>
> "In Gläsern des Systems"
> -
> I get
>
> "In Glasern des Systems"
> -
>
> Opening the file with Adobe Reader tells me that encoding is "Ansi"
>
> Is there a way to read this file correctly?
>
> Thanks in advance for any idea.
>
> Regards
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list