[R-sig-hpc] Handling data with thousands of variables

Brian G. Peterson brian at braverock.com
Sun Jun 26 12:12:29 CEST 2011


On Sun, 2011-06-26 at 09:07 +0200, Håvard Wahl Kongsgård wrote:
> In machine learning settings it's not uncommon that the data has
> thousands of variables. The same is also the case with genetic
> studies.
> 
> In R what is the best approach for handling such data? Any personal
> experience with handling such data in R?
> 
> For my case the raw data is a response variable and a unstructured
> tuple with string keywords.
> 
> 1341,{"Harry","Larry","Kline"}
> 54232,{"Mary","Kline","Larry"}
> 54232,{"David","Line","Lars"}

You haven't given us a lot of information to go on, and certainly
nothing reproducible.  My response necessarily makes several
assumptions. 

Thousands (or even millions) or records doesn't really seem like an
overly large amount of data to me. 

If each of your rows above is a 'record', the first thing I would do is
separate the tuple into separate fields using common factors for the
string keywords.  Processing factor levels should be generally faster
than processing strings.  If the length of the tuple is also variable,
then it may be useful to use a list with slots $response and $tuple.  If
your tuples are immutable and have some finite set of combinations which
they may attain, then perhaps the factors could be constructed of the
entire tuple, rather than the tuple elements.

If you intend to parallelize some part of your calculation (and that is
why you are writing to the High Performance Computing list), then you
should consider how to chunk up groups of records to send to each
computational worker or node to avoid interprocess communication
overhead.

It is not unlikely that my assumptions are incorrect for your problem.
In follow-up emails perhaps you could consult the posting guide and
provide more details and clarification.

Regards,

   - Brian

-- 
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock



More information about the R-sig-hpc mailing list