[R] sorting during xtabs? sorting by "individual" order?

Wed Nov 9 00:03:51 CET 2005

Hey alltogether,

refacturing a package (before it will be released),
I ran across the following problem.

I have two directories with different text files,
I want to read the first and construct a document-term
matrix from it (every term=word in a row, every file in
a column, occurrence frequencies form the values).

The second directory contains different files. It
needs to be read in to also construct a document-term
matrix -- however, in the same "term-order" to enable
similarity comparisons in a vector space of the
same format.

Let's make a (fake) example:

(1) support function

    # directory 1 contains 2 files (F1 & F2):
       F1 = c("word4", "word3", "word2")
       F2 = c("word1", "word4", "word2")

    # directory 2 contains also 2 files (F3 & F4):
       F3 = c("word1", "word2", "bla")
       F4 = c("word1", "word2", "word3")

    # I file in the first directory, file by file,
    # create triples of the format (file, word, 1)

        F1tab = sort(table(F1), decreasing = TRUE)
        F2tab = sort(table(F2), decreasing = TRUE)

    # and create a dataframe

        F1frame = data.frame( docs="F1", terms=names(F1tab),
                              Freq = F1tab, row.names = NULL)
        F2frame = data.frame( docs="F2", terms = names(F2tab),
                              Freq = F2tab, row.names = NULL)

(2) textmatrix function

    ... to be bound together for every file and to be
    converted with xtabs into a document term matrix:

        dummy = list(F1frame, F2frame)
        dtm = t(xtabs(Freq ~ ., data = do.call("rbind", dummy)))

        =>
               docs
        terms   F1 F2
          word2  1  1
          word3  1  0
          word4  1  1
          word1  0  1

    Now, when I want to re-use this to construct another
    document-term matrix from files F3&F4 -- with the same terms
    in the exactly same order, firstly, I need to add

        F3clean = F3[F3 %in% rownames(dtm)]
        F4clean = F4[F4 %in% rownames(dtm)]

    to keep "unwanted" terms from getting into the tabs.

    And here is my problem:

    I need to reformat the output document-term matrix
    (as it would be given by another time running step 2
    with F3clean and F4clean) to correspond with the given
    order of the rownames(dtm) of the first directory.

    How can I do this (not costly, the matrices I have to
    deal with are usually really big)? Hopefully just
    by adding s.th. to the xtabs function?

    To make an example of what I need: I need dtm2
    to look exactly like this (doc-order is not important):

        =>
               docs
        terms   F3 F4
          word2  1  1
          word3  1  1
          word4  0  0
          word1  1  1

    Can anybody help me?

Best,
Fridolin

-- 
Fridolin Wild, Institute for Information Systems and New Media,
Vienna University of Economics and Business Administration (WUW),
Augasse 2-6, A-1090 Wien, Austria
fon +43-1-31336-4488, fax +43-1-31336-746