[R] readPDF() -- unsure how to install xpdf to make this work?

clair.crossupton at googlemail.com clair.crossupton at googlemail.com
Sat Nov 15 19:14:02 CET 2008


Hello, I was just wondering if you had found a solution? I am having
the same difficulty of converting pdf's into plain text documents in
R. I originally thought I could use the readLines() function, but as
you can see below that did not work.

R> my.destfile <- "C:\\Documents and Settings\\clair\\Desktop\\test\\r-
intro.pdf"
R> my.url <- "http://cran.r-project.org/doc/manuals/R-intro.pdf"
R> download.file(url = my.url, destfile=my.destfile, mode='wb')
R> txt <- readLines(my.destfile)
R> txt
[1]
"%PDF-1.4"
[2]
"%ÐÔÅØ"
[3] "1 0 obj
<<"
[4] "/Length 587
"
[5] "/Filter /
FlateDecode"
[6]
">>"
[7]
"stream"
[8] "xÚmTM¢@\020½ó+z\017&ÎÁ±?\024tBL\020$ñ°ãd4›½*´.‰\002\001<øï·_•èÌf
\017’W¯_wÕ«îrðãc;Šòê`GæUŠOÛV×&³£øç¾ö\006ƒ¤Ê®\027[vïÖæ6ïWÛ7ñÑTÙÖvb
\030¯“uYt/N¼.³ó5·½êÿ¢¥=\025åS‚<b¸³¿G›"


Warm Regards,
Clair

On 13 Nov, 15:10, Tony Breyal <tony.bre... at googlemail.com> wrote:
> Dear R-Help,
>
> I need to convert a set of '.pdf' files into an equivalent set of
> '.txt' files. This is so that i can do some text mining on the
> content.
>
> In the latest R-News letter (http://cran.r-project.org/doc/Rnews/
> Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In
> that lovely package, there is a function called 'readPDF()'. In order
> to use this, ?readPDF says
>
>     "Note that this PDF reader needs both the tools pdftotext and
> pdfinfo installed and accessable on your system."
>
> These tools are available fromhttp://www.foolabs.com/xpdf/download.html
>
> I am able to download this and use it easily from a dos window to
> convert a pdf file into a txt file.
>
> Question: how do i make these tools available to R, so that i can use
> the readPDF() function?
>
> Thank you in advance for any help, and I hope the above made sense.
> Tony Breyal
>
> ###OS = Windows Vista Ultimate>> sessionInfo()
>
> R version 2.8.0 (2008-10-20)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] grid      stats     graphics  grDevices utils     datasets
> methods   base
>
> other attached packages:
> [1] tm_0.3-1           XML_1.98-1         Snowball_0.0-3
> RWeka_0.3-14       rJava_0.6-0        Matrix_0.999375-16
> lattice_0.17-15    filehash_2.0
>
> loaded via a namespace (and not attached):
> [1] proxy_0.4-1
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list