[R-sig-hpc] Handling data with thousands of variables

Håvard Wahl Kongsgård haavard.kongsgaard at gmail.com
Sun Jun 26 12:52:16 CEST 2011


Thanks, I am not very good at describing issues. Your assumptions are
correct, however in my case I have 20 000 different keywords and 10
000000 records so it's a big data set.

I don't use R for programming, but are slots more or less a dictionary
class(python) in R?

-Håvard

2011/6/26 Brian G. Peterson <brian at braverock.com>:
> On Sun, 2011-06-26 at 09:07 +0200, Håvard Wahl Kongsgård wrote:
>> In machine learning settings it's not uncommon that the data has
>> thousands of variables. The same is also the case with genetic
>> studies.
>>
>> In R what is the best approach for handling such data? Any personal
>> experience with handling such data in R?
>>
>> For my case the raw data is a response variable and a unstructured
>> tuple with string keywords.
>>
>> 1341,{"Harry","Larry","Kline"}
>> 54232,{"Mary","Kline","Larry"}
>> 54232,{"David","Line","Lars"}
>
> You haven't given us a lot of information to go on, and certainly
> nothing reproducible.  My response necessarily makes several
> assumptions.
>
> Thousands (or even millions) or records doesn't really seem like an
> overly large amount of data to me.
>
> If each of your rows above is a 'record', the first thing I would do is
> separate the tuple into separate fields using common factors for the
> string keywords.  Processing factor levels should be generally faster
> than processing strings.  If the length of the tuple is also variable,
> then it may be useful to use a list with slots $response and $tuple.  If
> your tuples are immutable and have some finite set of combinations which
> they may attain, then perhaps the factors could be constructed of the
> entire tuple, rather than the tuple elements.
>
> If you intend to parallelize some part of your calculation (and that is
> why you are writing to the High Performance Computing list), then you
> should consider how to chunk up groups of records to send to each
> computational worker or node to avoid interprocess communication
> overhead.
>
> It is not unlikely that my assumptions are incorrect for your problem.
> In follow-up emails perhaps you could consult the posting guide and
> provide more details and clarification.
>
> Regards,
>
>   - Brian
>
> --
> Brian G. Peterson
> http://braverock.com/brian/
> Ph: 773-459-4973
> IM: bgpbraverock
>
>



More information about the R-sig-hpc mailing list