[R] tm: Why does adding local metadata take so long?

Richard R. Liu richard.liu at pueo-owl.ch
Wed Oct 14 00:44:15 CEST 2009


I'm running tm 0.5 on R 2.9.2 on a MacBook Pro 17" unibody early 2009  
2.93 GHz 4GB RAM.  I have a directory with 1697 plain text files on  
the Mac, that I want to analyze with the tm package.  I have read the  
documents into a corpus, Corpus_3compounds, as follows:

# Assign directory to a character vector
dirName <- "/Volumes/RDR Test Documents/3Compounds/TXT"

# Put the paths of the .txt files in the directory into a vector
Files_3compounds <- dir(dirName,
	full.names = TRUE,
	pattern = "_.*\\.txt",
	ignore.case = TRUE)

# Use that vector to create a DirSource object
Dir_3compounds <- DirSource(dirName,
	pattern = "_.*\\.txt",
	ignore.case = TRUE,
	encoding = "latin1")

# Read the .txt files into a volatile corpus
Corpus_3compounds <- Corpus(Dir_3compounds,
	readerControl = list(reader = readPlain,
		language = "en",
		load = TRUE))

I have the metadata for these text documents in an Excel table, which  
I have read into Metadata_3compounds as follows:

# Read the metadata into a data frame
Metadata_3compounds <- read.xls("/Volumes/RDR Test Documents/ 
3Compounds/3compounds.xls",
	sheet = 3, verbose = TRUE, pattern = "Document",
	method = "tab", perl ="perl")

Since the metadata and the text documents in the corpus are not in the  
same order, I have to create an index between the two.  Basically, the  
filename contains the document ID.

# Index of the metadata for a document in the corpus in  
Metadata_3compounds
iMyMetadata <- match(gsub("^(.*)/_(.*)\\.txt$", "\\2",  
Files_3compounds, perl = TRUE), Metadata_3compounds$Document.No)

The metadata dataframe has the following names:

  [1] "Document.No"               ...
  [5] ...
  [9] "total"                     "SET"                        
"CAT1"                      "CAT2"
[13] "Title"                     "Approved.By"                
"Author.s."                 "Center"
[17] "Comment"                   "Date.Approved"              
"Date.Submitted"            "Department"
[21] "Division"                  "Document.Class"             
"Document.Date"             "Document.No.1"
[25] "Language"                  "Pages"                      
"Project.ID..Theme.Number." "Rapid.Document"
[29] "Report.No"                 "Study.Protocol.No"          
"Submitted.By"              "Substance.ID"

Now I want to assign this metadata to the local metadata of the  
documents in the corpus, for example as follows:

# Transfer metadata to local
meta(Corpus_3compounds, type = "local", tag = "DocId") <-  
Metadata_3compounds$Document.No[iMyMetadata]

I have let this statement run for more than twenty minutes before  
deciding to stop it  I just cannot imagine that it should take  
anywhere near as long.  If I assign the same vector to the indexed  
metadata of the corpus instead, it finishes in just a bit more than a  
blink of an eye.  When I limit the number of documents to five I can  
verify that the code is correct.

QUESTIONS: Is it normal for this operation to take so long on a corpus  
of 1697 documents?  Is there a quicker way of accomplishing the same  
thing?  I really do want to store the metadata with the document,  
i.e., as local metadata.  I am uncertain about the advantages, but I  
would think that, if I delete or filter out a document, the metadata  
is deleted or filtered as well.  Furthermore, when I cluster the  
documents or train a machine learner on them, I could imagine -- but I  
do not know for sure -- that it might be easier to use local metadata  
as a feature, whereas that might not be so easy with indexed metadata.

Regards,
Richard Liu




More information about the R-help mailing list