[R] Help request: Parsing docx files for key words and appending to a spreadsheet

Dr Eberhard W Lisse no@p@m @end|ng |rom ||@@e@NA
Fri Dec 29 21:25:06 CET 2023


I would also look at https://pandoc.org perhaps which can
export a number of formats...

And for spreadsheets https://github.com/jqnatividad/qsv is my
goto weapon.  Can also read and write XLSX and others.

A sample document or two would always be helpful...

el

On 29/12/2023 21:01, CALUM POLWART wrote:
> It sounded like he looked at officeR but I would agree
> 
> content <- officer::docx_summary("filename.docx")
> 
> Would get the text content into an object called content.
> 
> That object is a data.frame so you can then manipulate it.
> To be more specific, we might need an example of the DF
[...]
>> On Fri, Dec 29, 2023 at 10:14 AM Andy <phaedrusv using gmail.com>
>> wrote:
[...]
>>> I'd like to be able to accomplish the following:
>>>
>>> (1) Append the title, the month, the author, the number of
>>> words, and page number(s) to a spreadsheet
>>>
>>> (2) Read each article and extract keywords (in the docs,
>>> these are listed in 'Subject' section as a list of
>>> keywords with a percentage showing the extent to which the
>>> keyword features in the article (e.g., FAST FASHION (72%))
>>> and to append the keyword and the % coverage to the same
>>> row in the spreadsheet.  However, I want to ensure that
>>> the keyword coverage meets the threshold of >= 50%; if
>>> not, then pass onto the next article in the directory.
>>> Rinse and repeat for the entire directory.
[...]



More information about the R-help mailing list