\documentclass[a4paper]{article} \usepackage[margin=2cm]{geometry} \usepackage[round]{natbib} \usepackage{url} \newcommand{\acronym}[1]{\textsc{#1}} \newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\proglang}[1]{\textsf{#1}} \let\code\texttt %% \VignetteIndexEntry{Extensions} \begin{document} <>= library("tm") library("xml2") @ \title{Extensions\\How to Handle Custom File Formats} \author{Ingo Feinerer} \maketitle \section*{Introduction} The possibility to handle custom file formats is a substantial feature in any modern text mining infrastructure. \pkg{tm} has been designed aware of this aspect from the beginning on, and has modular components which allow for extensions. A general explanation of \pkg{tm}'s extension mechanism is described by~\citet[Sec.~3.3]{Feinerer_etal_2008}, with an updated description as follows. \section*{Sources} A source abstracts input locations and provides uniform methods for access. Each source must provide implementations for following interface functions: \begin{description} \item[close()] closes the source and returns it, \item[eoi()] returns \code{TRUE} if the end of input of the source is reached, \item[getElem()] fetches the element at the current position, \item[length()] gives the number of elements, \item[open()] opens the source and returns it, \item[reader()] returns a default reader for processing elements, \item[pGetElem()] (optional) retrieves all elements in parallel at once, and \item[stepNext()] increases the position in the source to the next element. \end{description} Retrieved elements must be encapsulated in a list with the named components \code{content} holding the document and \code{uri} pointing to the origin of the document (e.g., a file path or a \acronym{URL}; \code{NULL} if not applicable or unavailable). Custom sources are required to inherit from the virtual base class \code{Source} and typically do so by extending the functionality provided by the simple reference implementation \code{SimpleSource}. E.g., a simple source which accepts an \proglang{R} vector as input could be defined as <>= VecSource <- function(x) SimpleSource(length = length(x), content = as.character(x), class = "VecSource") @ which overrides a few defaults (see \code{?SimpleSource} for defaults) and stores the vector in the \code{content} component. The functions \code{close()}, \code{eoi()}, \code{open()}, and \code{stepNext()} have reasonable default methods already for the \code{SimpleSource} class: the identity function for \code{open()} and \code{close()}, incrementing a position counter for \code{stepNext()}, and comparing the current position with the number of available elements as claimed by \code{length()} for \code{eoi()}, respectively. So we only need custom methods for element access: <>= getElem.VecSource <- function(x) list(content = x$content[x$position], uri = NULL) pGetElem.VecSource <- function(x) lapply(x$content, function(y) list(content = y, uri = NULL)) @ \section*{Readers} Readers are functions for extracting textual content and metadata out of elements delivered by a source and for constructing a text document. Each reader must accept following arguments in its signature: \begin{description} \item[elem] a list with the named components \code{content} and \code{uri} (as delivered by a source via \code{getElem()} or \code{pGetElem()}), \item[language] a string giving the language, and \item[id] a character giving a unique identifier for the created text document. \end{description} The element \code{elem} is typically provided by a source whereas the language and the identifier are normally provided by a corpus constructor (for the case that \code{elem\$content} does not give information on these two essential items). In case a reader expects configuration arguments we can use a function generator. A function generator is indicated by inheriting from class \code{FunctionGenerator} and \code{function}. It allows us to process additional arguments, store them in an environment, return a reader function with the well-defined signature described above, and still be able to access the additional arguments via lexical scoping. All corpus constructors in package \pkg{tm} check the reader function for being a function generator and if so apply it to yield the reader with the expected signature. E.g., the reader function \code{readPlain()} is defined as <>= readPlain <- function(elem, language, id) PlainTextDocument(elem$content, id = id, language = language) @ For examples on readers using the function generator please have a look at \code{?readPDF} or \code{?readPDF}. However, for many cases, it is not necessary to define each detailed aspect of how to extend \pkg{tm}. Typical examples are \acronym{XML} files which are very common but can be rather easily handled via standard conforming \acronym{XML} parsers. The aim of the remainder in this document is to give an overview on how simpler, more user-friendly, forms of extension mechanisms can be applied in \pkg{tm}. \section*{Custom Data Formats} A general situation is that you have gathered together some information into a tabular data structure (like a data frame or a list matrix) that suffices to describe documents in a corpus. However, you do not have a distinct file format because you extracted the information out of various resources, e.g., as delivered by \code{readtext()} in package \pkg{readtext}. Now you want to use your information to build a corpus which is recognized by \pkg{tm}. We assume that your information is put together in a data frame. E.g., consider the following example: <>= df <- data.frame(doc_id = c("doc 1" , "doc 2" , "doc 3" ), text = c("content 1", "content 2", "content 3"), title = c("title 1" , "title 2" , "title 3" ), authors = c("author 1" , "author 2" , "author 3" ), topics = c("topic 1" , "topic 2" , "topic 3" ), stringsAsFactors = FALSE) @ We want to map the data frame rows to the relevant entries of a text document. An entry \code{text} in the mapping will be matched to fill the actual content of the text document, \code{doc\_id} will be used as document ID, all other fields will be used as metadata tags. So we can construct a corpus out of the data frame: <<>>= (corpus <- Corpus(DataframeSource(df))) corpus[[1]] meta(corpus[[1]]) @ \section*{Custom XML Sources} Many modern file formats already come in \acronym{XML} format which allows to extract information with any \acronym{XML} conforming parser, e.g., as implemented in \proglang{R} by the \pkg{xml2} package. Now assume we have some custom \acronym{XML} format which we want to access with \pkg{tm}. Then a viable way is to create a custom \acronym{XML} source which can be configured with only a few commands. E.g., have a look at the following example: <>= custom.xml <- system.file("texts", "custom.xml", package = "tm") print(readLines(custom.xml), quote = FALSE) @ As you see there is a top-level tag stating that there is a corpus, and several document tags below. In fact, this structure is very common in \acronym{XML} files found in text mining applications (e.g., both the Reuters-21578 and the Reuters Corpus Volume 1 data sets follow this general scheme). In \pkg{tm} we expect a source to deliver self-contained blocks of information to a reader function, each block containing all information necessary such that the reader can construct a (subclass of a) \code{TextDocument} from it. The \code{XMLSource()} function can now be used to construct a custom \acronym{XML} source. It has three arguments: \begin{description} \item[x] a character giving a uniform resource identifier, \item[parser] a function accepting an \acronym{XML} document (as delivered by \code{read\_xml()} in package \pkg{xml2}) as input and returning a \acronym{XML} elements/nodes (each element/node will then be delivered to the reader as a self-contained block), \item[reader] a reader function capable of turning \acronym{XML} elements/nodes as returned by the parser into a subclass of \code{TextDocument}. \end{description} E.g., a custom source which can cope with our custom \acronym{XML} format could be: <>= mySource <- function(x) XMLSource(x, parser = xml2::xml_children, reader = myXMLReader) @ As you notice in this example we also provide a custom reader function (\code{myXMLReader}). See the next section for details. \section*{Custom XML Readers} As we saw in the previous section we often need a custom reader function to extract information out of \acronym{XML} chunks (typically as delivered by some source). Fortunately, \pkg{tm} provides an easy way to define custom \acronym{XML} reader functions. All you need to do is to provide a so-called \emph{specification}. Let us start with an example which defines a reader function for the file format from the previous section: <>= myXMLReader <- readXML( spec = list(author = list("node", "writer"), content = list("node", "description"), datetimestamp = list("function", function(x) as.POSIXlt(Sys.time(), tz = "GMT")), description = list("node", "@short"), heading = list("node", "caption"), id = list("function", function(x) tempfile()), origin = list("unevaluated", "My private bibliography"), type = list("node", "type")), doc = PlainTextDocument()) @ Formally, \code{readXML()} is the relevant function which constructs an reader. The customization is done via the first argument \code{spec}, the second provides an empty instance of the document which should be returned (augmented with the extracted information out of the \acronym{XML} chunks). The specification must consist of a named list of lists each containing two character vectors. The constructed reader will map each list entry to the content or a metadatum of the text document as specified by the named list entry. Valid names include \code{content} to access the document's content, and character strings which are mapped to metadata entries. Each list entry must consist of two character vectors: the first describes the type of the second argument, and the second is the specification entry. Valid combinations are: \begin{description} \item[\code{type = "node", spec = "XPathExpression"}] the XPath (1.0) expression \code{spec} extracts information out of an \acronym{XML} node (as seen for \code{author}, \code{content}, \code{description}, \code{heading}, and \code{type} in our example specification). \item[\code{type = "function", spec = function(doc) \ldots}] The function \code{spec} is called, passing over the \acronym{XML} document (as delivered by \code{read\_xml()} from package \pkg{xml2}) as first argument (as seen for \code{datetimestamp} and \code{id}). As you notice in our example nobody forces us to actually use the passed over document, instead we can do anything we want (e.g., create a unique character vector via \code{tempfile()} to have a unique identification string). \item[\code{type = "unevaluated", spec = "String"}] the character vector \code{spec} is returned without modification (e.g., \code{origin} in our specification). \end{description} Now that we have all we need to cope with our custom file format, we can apply the source and reader function at any place in \pkg{tm} where a source or reader is expected, respectively. E.g., <<>>= corpus <- VCorpus(mySource(custom.xml)) @ constructs a corpus out of the information in our \acronym{XML} file: <<>>= corpus[[1]] meta(corpus[[1]]) @ \bibliographystyle{abbrvnat} \bibliography{references} \end{document}