[R] Help request: Parsing docx files for key words and appending to a spreadsheet

Thu Jan 4 13:59:59 CET 2024

Hi folks

Thanks for your help and suggestions - very much appreciated.

I now have some working code, using this file I uploaded for public 
access: 
https://docs.google.com/document/d/1QwuaWZk6tYlWQXJ3WLczxC8Cda6zVERk/edit?usp=sharing&ouid=103065135255080058813&rtpof=true&sd=true 

The small code segment that now works is as follows:

###########

# Load libraries
library(textreadr)
library(tcltk)
library(tidyverse)
#library(officer)
#library(stringr) #for splitting and trimming raw data
#library(tidyr) #for converting to wide format

# I'd like to keep this as it enables more control over the selected 
directories
filepath <- setwd(tk_choose.dir())

# The following correctly lists the names of all 9 files in my test 
directory
files <- list.files(filepath, ".docx")
files
length(files)

# Ideally, I'd like to skip this step by being able to automatically 
read in the name of each file, but one step at a time:
filename <- "Now they want us to charge our electric cars from litter 
bins.docx"

# This produces the file content as output when run, and identifies the 
fields that I want to extract.
read_docx(filename) %>%
   str_split(",") %>%
   unlist() %>%
   str_trim()

###########

What I'd like to try and accomplish next is to extract the data from 
selected fields and append to a spreadsheet (Calc or Excel) under 
specific columns, or if it is easier to write a CSV which I can then use 
later.

The fields I want to extract are illustrated with reference to the above 
file, viz.:

The title: "Now they want us to charge our electric cars from litter bins"
The name of the newspaper: "Mail on Sunday (London)"
The publication date: "September 24, 2023" (in date format, preferably 
separated into month and year (day is not important))
The section: "NEWS"
The page number(s): "16" (as numeric)
The length: "515" (as numeric)
The author: "Anna Mikhailova"
The subject: from the Subject section, but this is to match a value e.g. 
GREENWASHING >= 50% (here this value is 51% so would be included). A 
match moves onto select the highest value under the section "Industry" 
(here it is ELECTRIC MOBILITY (91%)) and appends this text and % value. 
If no match with 'Greenwashing', then appends 'Null' and moves onto the 
next file in the directory.

###########

The theory I am working with is if I can figure out how to extract these 
fields and append correctly, then the rest should just be wrapping this 
up in a for loop.

However, I am struggling to get my head around the extraction and append 
part. If I can get it to work for one of these fields, I suspect that I 
can repeat the basic syntax to extract and append the remaining fields.

Therefore, if someone can either suggest a syntax or point me to a 
useful tutorial, that would be splendid.

Thank you in anticipation.

Best wishes
Andy

<snip>