[R] Help request: Parsing docx files for key words and appending to a spreadsheet
Andy
ph@edru@v @end|ng |rom gm@||@com
Thu Jan 4 13:59:59 CET 2024
Hi folks
Thanks for your help and suggestions - very much appreciated.
I now have some working code, using this file I uploaded for public
access:
https://docs.google.com/document/d/1QwuaWZk6tYlWQXJ3WLczxC8Cda6zVERk/edit?usp=sharing&ouid=103065135255080058813&rtpof=true&sd=true
The small code segment that now works is as follows:
###########
# Load libraries
library(textreadr)
library(tcltk)
library(tidyverse)
#library(officer)
#library(stringr) #for splitting and trimming raw data
#library(tidyr) #for converting to wide format
# I'd like to keep this as it enables more control over the selected
directories
filepath <- setwd(tk_choose.dir())
# The following correctly lists the names of all 9 files in my test
directory
files <- list.files(filepath, ".docx")
files
length(files)
# Ideally, I'd like to skip this step by being able to automatically
read in the name of each file, but one step at a time:
filename <- "Now they want us to charge our electric cars from litter
bins.docx"
# This produces the file content as output when run, and identifies the
fields that I want to extract.
read_docx(filename) %>%
str_split(",") %>%
unlist() %>%
str_trim()
###########
What I'd like to try and accomplish next is to extract the data from
selected fields and append to a spreadsheet (Calc or Excel) under
specific columns, or if it is easier to write a CSV which I can then use
later.
The fields I want to extract are illustrated with reference to the above
file, viz.:
The title: "Now they want us to charge our electric cars from litter bins"
The name of the newspaper: "Mail on Sunday (London)"
The publication date: "September 24, 2023" (in date format, preferably
separated into month and year (day is not important))
The section: "NEWS"
The page number(s): "16" (as numeric)
The length: "515" (as numeric)
The author: "Anna Mikhailova"
The subject: from the Subject section, but this is to match a value e.g.
GREENWASHING >= 50% (here this value is 51% so would be included). A
match moves onto select the highest value under the section "Industry"
(here it is ELECTRIC MOBILITY (91%)) and appends this text and % value.
If no match with 'Greenwashing', then appends 'Null' and moves onto the
next file in the directory.
###########
The theory I am working with is if I can figure out how to extract these
fields and append correctly, then the rest should just be wrapping this
up in a for loop.
However, I am struggling to get my head around the extraction and append
part. If I can get it to work for one of these fields, I suspect that I
can repeat the basic syntax to extract and append the remaining fields.
Therefore, if someone can either suggest a syntax or point me to a
useful tutorial, that would be splendid.
Thank you in anticipation.
Best wishes
Andy
<snip>
More information about the R-help
mailing list