[R] Figuring out encodings of PDFs in R
Duncan Murdoch
murdoch.duncan at gmail.com
Wed Jun 27 02:07:27 CEST 2012
On 12-06-26 3:28 PM, Jonas Michaelis wrote:
> Dear list,
>
> I am currently scraping some text data from several PDFs using the
> readPDF() function in the tm package. This all works very well and in most
> cases the encoding seems to be "latin1" - in some, however, it is not. Is
> there a good way in R to check character encodings? I found the functions
> is.utf8() and is.local() in the tau package but that obviously only gets me
> so far.
>
There are heuristics for guessing encodings, but I don't think they are
built into R. I think the way to do what you want is to read the PDF
spec to find out how the strings are encoded in the source file, and
believe that.
Duncan Murdoch
More information about the R-help
mailing list