[R] Help request: Parsing docx files for key words and appending to a spreadsheet

Wed Jan 3 13:26:16 CET 2024

If you do something like this

	for i in  $(pandoc --list-output-formats);
		do pandoc -f docx -t $i -o test.$i Now\ they\ want\ us\ to\ charge\
	our\ electric\ cars\ from\ litter\ bins.docx;
	done

you get approximately 65 formats, from which you can pick one which you can
write a little parser for. The dokuwiki one for example uses long lines
which
makes parsing easier.

el

On 2023-12-30 13:57 , Andy wrote:
> Good idea, El - thanks.
>
> The link is
> https://docs.google.com/document/d/1QwuaWZk6tYlWQXJ3WLczxC8Cda6zVERk/edit?usp=sharing&ouid=103065135255080058813&rtpof=true&sd=true
>
>  This is helpful.
>
> From the article, which is typical of Lexis+ output, I want to
> extract the following fields and append to a Calc/ Excel spreadsheet.
> Given the volume of articles I have to work through, if this can be
> iterative and semi-automatic, that would be a god send and I might be
> able to do some actual research on the articles before I reach my
> pensionable age. :-)
>
> Title Newspaper Date Section and page number Length Byline Subject
> (only if the threshold of coverage for a specific subject is
>> =50% is reached (e.g. Greenwashing (51%)) - if not, enter 'nil' and
>>
> move onto the next article in the folder
>
> This is the ambition. I am clearly a long way short of that though.
>
> Many thanks. Andy