[R] sorting during xtabs? sorting by "individual" order?
Fridolin Wild
fridolin.wild at wu-wien.ac.at
Wed Nov 9 00:03:51 CET 2005
Hey alltogether,
refacturing a package (before it will be released),
I ran across the following problem.
I have two directories with different text files,
I want to read the first and construct a document-term
matrix from it (every term=word in a row, every file in
a column, occurrence frequencies form the values).
The second directory contains different files. It
needs to be read in to also construct a document-term
matrix -- however, in the same "term-order" to enable
similarity comparisons in a vector space of the
same format.
Let's make a (fake) example:
(1) support function
# directory 1 contains 2 files (F1 & F2):
F1 = c("word4", "word3", "word2")
F2 = c("word1", "word4", "word2")
# directory 2 contains also 2 files (F3 & F4):
F3 = c("word1", "word2", "bla")
F4 = c("word1", "word2", "word3")
# I file in the first directory, file by file,
# create triples of the format (file, word, 1)
F1tab = sort(table(F1), decreasing = TRUE)
F2tab = sort(table(F2), decreasing = TRUE)
# and create a dataframe
F1frame = data.frame( docs="F1", terms=names(F1tab),
Freq = F1tab, row.names = NULL)
F2frame = data.frame( docs="F2", terms = names(F2tab),
Freq = F2tab, row.names = NULL)
(2) textmatrix function
... to be bound together for every file and to be
converted with xtabs into a document term matrix:
dummy = list(F1frame, F2frame)
dtm = t(xtabs(Freq ~ ., data = do.call("rbind", dummy)))
=>
docs
terms F1 F2
word2 1 1
word3 1 0
word4 1 1
word1 0 1
Now, when I want to re-use this to construct another
document-term matrix from files F3&F4 -- with the same terms
in the exactly same order, firstly, I need to add
F3clean = F3[F3 %in% rownames(dtm)]
F4clean = F4[F4 %in% rownames(dtm)]
to keep "unwanted" terms from getting into the tabs.
And here is my problem:
I need to reformat the output document-term matrix
(as it would be given by another time running step 2
with F3clean and F4clean) to correspond with the given
order of the rownames(dtm) of the first directory.
How can I do this (not costly, the matrices I have to
deal with are usually really big)? Hopefully just
by adding s.th. to the xtabs function?
To make an example of what I need: I need dtm2
to look exactly like this (doc-order is not important):
=>
docs
terms F3 F4
word2 1 1
word3 1 1
word4 0 0
word1 1 1
Can anybody help me?
Best,
Fridolin
--
Fridolin Wild, Institute for Information Systems and New Media,
Vienna University of Economics and Business Administration (WUW),
Augasse 2-6, A-1090 Wien, Austria
fon +43-1-31336-4488, fax +43-1-31336-746
More information about the R-help
mailing list