[R] readPDF() -- unsure how to install xpdf to make this work?
clair.crossupton at googlemail.com
clair.crossupton at googlemail.com
Sat Nov 15 19:14:02 CET 2008
Hello, I was just wondering if you had found a solution? I am having
the same difficulty of converting pdf's into plain text documents in
R. I originally thought I could use the readLines() function, but as
you can see below that did not work.
R> my.destfile <- "C:\\Documents and Settings\\clair\\Desktop\\test\\r-
intro.pdf"
R> my.url <- "http://cran.r-project.org/doc/manuals/R-intro.pdf"
R> download.file(url = my.url, destfile=my.destfile, mode='wb')
R> txt <- readLines(my.destfile)
R> txt
[1]
"%PDF-1.4"
[2]
"%ÐÔÅØ"
[3] "1 0 obj
<<"
[4] "/Length 587
"
[5] "/Filter /
FlateDecode"
[6]
">>"
[7]
"stream"
[8] "xÚmTM¢@\020½ó+z\017&ÎÁ±?\024tBL\020$ñ°ãd4½*´.\002\001<øï·_èÌf
\017W¯_wÕ«îrðãc;òê`GæUOÛV×&³£øç¾ö\006¤Ê®\027[vïÖæ6ïWÛ7ñÑTÙÖvb
\030¯uYt/N¼.³ó5·½êÿ¢¥=\025åS<b¸³¿G"
Warm Regards,
Clair
On 13 Nov, 15:10, Tony Breyal <tony.bre... at googlemail.com> wrote:
> Dear R-Help,
>
> I need to convert a set of '.pdf' files into an equivalent set of
> '.txt' files. This is so that i can do some text mining on the
> content.
>
> In the latest R-News letter (http://cran.r-project.org/doc/Rnews/
> Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In
> that lovely package, there is a function called 'readPDF()'. In order
> to use this, ?readPDF says
>
> "Note that this PDF reader needs both the tools pdftotext and
> pdfinfo installed and accessable on your system."
>
> These tools are available fromhttp://www.foolabs.com/xpdf/download.html
>
> I am able to download this and use it easily from a dos window to
> convert a pdf file into a txt file.
>
> Question: how do i make these tools available to R, so that i can use
> the readPDF() function?
>
> Thank you in advance for any help, and I hope the above made sense.
> Tony Breyal
>
> ###OS = Windows Vista Ultimate>> sessionInfo()
>
> R version 2.8.0 (2008-10-20)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] grid stats graphics grDevices utils datasets
> methods base
>
> other attached packages:
> [1] tm_0.3-1 XML_1.98-1 Snowball_0.0-3
> RWeka_0.3-14 rJava_0.6-0 Matrix_0.999375-16
> lattice_0.17-15 filehash_2.0
>
> loaded via a namespace (and not attached):
> [1] proxy_0.4-1
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list