[R] tm: Why does adding local metadata take so long?
Richard R. Liu
richard.liu at pueo-owl.ch
Wed Oct 14 00:44:15 CEST 2009
I'm running tm 0.5 on R 2.9.2 on a MacBook Pro 17" unibody early 2009
2.93 GHz 4GB RAM. I have a directory with 1697 plain text files on
the Mac, that I want to analyze with the tm package. I have read the
documents into a corpus, Corpus_3compounds, as follows:
# Assign directory to a character vector
dirName <- "/Volumes/RDR Test Documents/3Compounds/TXT"
# Put the paths of the .txt files in the directory into a vector
Files_3compounds <- dir(dirName,
full.names = TRUE,
pattern = "_.*\\.txt",
ignore.case = TRUE)
# Use that vector to create a DirSource object
Dir_3compounds <- DirSource(dirName,
pattern = "_.*\\.txt",
ignore.case = TRUE,
encoding = "latin1")
# Read the .txt files into a volatile corpus
Corpus_3compounds <- Corpus(Dir_3compounds,
readerControl = list(reader = readPlain,
language = "en",
load = TRUE))
I have the metadata for these text documents in an Excel table, which
I have read into Metadata_3compounds as follows:
# Read the metadata into a data frame
Metadata_3compounds <- read.xls("/Volumes/RDR Test Documents/
sheet = 3, verbose = TRUE, pattern = "Document",
method = "tab", perl ="perl")
Since the metadata and the text documents in the corpus are not in the
same order, I have to create an index between the two. Basically, the
filename contains the document ID.
# Index of the metadata for a document in the corpus in
iMyMetadata <- match(gsub("^(.*)/_(.*)\\.txt$", "\\2",
Files_3compounds, perl = TRUE), Metadata_3compounds$Document.No)
The metadata dataframe has the following names:
 "Document.No" ...
 "total" "SET"
 "Title" "Approved.By"
 "Comment" "Date.Approved"
 "Division" "Document.Class"
 "Language" "Pages"
 "Report.No" "Study.Protocol.No"
Now I want to assign this metadata to the local metadata of the
documents in the corpus, for example as follows:
# Transfer metadata to local
meta(Corpus_3compounds, type = "local", tag = "DocId") <-
I have let this statement run for more than twenty minutes before
deciding to stop it I just cannot imagine that it should take
anywhere near as long. If I assign the same vector to the indexed
metadata of the corpus instead, it finishes in just a bit more than a
blink of an eye. When I limit the number of documents to five I can
verify that the code is correct.
QUESTIONS: Is it normal for this operation to take so long on a corpus
of 1697 documents? Is there a quicker way of accomplishing the same
thing? I really do want to store the metadata with the document,
i.e., as local metadata. I am uncertain about the advantages, but I
would think that, if I delete or filter out a document, the metadata
is deleted or filtered as well. Furthermore, when I cluster the
documents or train a machine learner on them, I could imagine -- but I
do not know for sure -- that it might be easier to use local metadata
as a feature, whereas that might not be so easy with indexed metadata.
More information about the R-help