[R] textual analysis - transforming several pdf to txt - naming the files

Rui Barradas ru|pb@rr@d@@ @end|ng |rom @@po@pt
Wed Jul 5 11:57:33 CEST 2023


Às 10:14 de 05/07/2023, Cecília Carmo escreveu:
> I am taking my first steps in textual analysis with R.
> I have pdf files consisting of company reports for several years (1 file corresponds to 1 company and 1 year).
> My idea is to start by transforming all my pdf files into txt files for further treatment and analysis (this will allow me to group the files by company or by year, depending on the future analysis to be performed).
> I do not have in-depth knowledge of programming in R. I just adapt codes that I find, to my needs. Here goes the first doubt in a code I'm adapting:
> 
> My pdf files are in one directory named "pdfs". The names of my files are, for example, SONAE2020FS.pdf, EDP2021GS.pdf
> I want to convert them to txt and give the same names as in the pdf files: SOANE2020FS.txt, EDP2021GS.txt
> I'm running the following scrip, but the names of txt files that I obtain are: pdftext1, pdftext2, pdftext3...
> What do I need to change?
> Thank you very much,
> 
> Cec�lia Carmo
> Universidade de Aveiro - Portugal
> 
> 
> dirpath <- ("/Users/ceciliacarmo/documents/RTextualAnalysis/data/pdfs")
> 
> 
> library(pdftools)
> 
> library(dplyr)
> 
> 
> convertpdf2txt <- function(dirpath){
> 
>    files <- list.files(dirpath, full.names = T)
> 
>    x <- sapply(files, function(x){
> 
>    x <- pdftools::pdf_text(x) %>%
> 
>    paste0(collapse = " ") %>%
> 
>    stringr::str_squish()
> 
>    return(x)
> 
>      })
> 
> }
> 
> # apply function
> 
> txts <- convertpdf2txt(here::here("data", "pdf/"))
> 
> # add names to txt files
> 
> names(txts) <- paste0(here::here("data","pdftext"), 1:length(txts), sep = "")
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Hello,

Try the following.
The corrected function convertpdf2txt assigns names based on the files 
variable.
It uses tools::file_path_sans_ext to keep the filename without extension 
and pastes the new extension to them. In the end there is no need to 
call here::here again, the list already is a named list.



convertpdf2txt <- function(dirpath){
   files <- list.files(dirpath, pattern = "Consoli.*\\.pdf$", full.names 
= TRUE)
   files <- chartr("\\", "/", files)

   x <- lapply(files, function(x){
     pdftools::pdf_text(x) %>%
       paste0(collapse = " ") %>%
       stringr::str_squish()
   })
   new_names <- tools::file_path_sans_ext(files)
   new_names <- paste(new_names, "txt", sep = ".")
   setNames(x, new_names)
}

# apply function
# note that my test files are in "~/Temp"
txts <- convertpdf2txt(here::here("~", "Temp"))
names(txts)



Hope this helps,

Rui Barradas



More information about the R-help mailing list