[R-sig-hpc] Handling data with thousands of variables

Brian G. Peterson brian at braverock.com
Sun Jun 26 16:06:20 CEST 2011


On Sun, 2011-06-26 at 14:09 +0200, Håvard Wahl Kongsgård wrote:
> Again sorry about the bad example. The tuples are not the same length,
> some have 20 object others 150...

Some facts about your job:
~ 10 000 000 records
~     20 000 keywords
- each record consists of a combination of 
  + response variable and
  + structured string-based tuple of ~20-150 keywords 

So, to ask more questions and avoid more assumptions:

- are the response variables numeric? (integer or floating point?)
- does the order of the tuples matter ?
- do you know all the possible keywords ?
  (so that they could be encoded with numerical representations)

Regards,

  - Brian
-- 
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock



More information about the R-sig-hpc mailing list