[R-sig-hpc] Handling data with thousands of variables

Sun Jun 26 13:06:14 CEST 2011

On Sun, 2011-06-26 at 12:52 +0200, Håvard Wahl Kongsgård wrote:
> Thanks, I am not very good at describing issues. Your assumptions are
> correct, however in my case I have 20 000 different keywords and 10
> 000000 records so it's a big data set.

OK, so then I would definitely reiterate the suggestion to pre-process
the tuples into factors, so that each of your 20000 keywords will have a
numeric representation.

> I don't use R for programming, but are slots more or less a dictionary
> class(python) in R?

'slots' are the (named) list elements in the older S3 class system in R,
and many people (including me) use the word 'slot' generically to refer
to named list elements in general.  lists in R can contain elements that
hold any other arbitrary class of object/data, and so are useful for
'mixed type' collections.

Given the additional information you've provided, and making the further
assumption that your tuples are always the same length (three keywords):

I would probably first construct the factor levels for your keywords,
see ?factor 

I would 'flatten' the data into a four-column representation

response,tuple1,tuple2,tuple3

using the factor-numeric representation for each tuple.  This could be
stored using either a data.frame or a matrix.  If you use the pure
numeric representation, a matrix will be faster.

Then, as discussed previously, if you intend to parallelize, since your
individual records are small, consider how to batch them up for sending
to your worker/compute processes.

Regards,

   - Brian

-- 
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock