Off topic -- large data sets. Was RE: [R] 64 Bit R Background Question

Naji nassar at noos.fr
Tue Feb 15 09:24:00 CET 2005


Hi,


Also I agree those cases are relatively rare in STATISTICAL analysis, you
can encounter them for simulation topics (natural catalysm a 5 meter in the
topographics can change all the simulations)
Two ideas (in addition to loading several sections) is
1- to search for duplicate cases and estimate your model upon a frequency
weighted shema, perhaps you don't have 200millions different 'cases'
2- take into account your data and model the used algorythm
precision/accuracy, (i.e. No need to take into account 1million case, a
precision close to .001, if the gradient, or any other function used, has a
.01 accuracy) ...

Best regards
Naji

Le 14/02/05 18:41, « Berton Gunter » <gunter.berton at gene.com> a écrit :

> 
>>> read all 200 million rows a pipe dream no matter what
>> platform I'm using?
>> 
>> In principle R can handle this with enough memory. However,
>> 200 million 
>> rows and three columns is 4.8Gb of storage, and R usually needs a few
>> times the size of the data for working space.
>> 
>> You would likely be better off not reading the whole data set
>> at once, but 
>> loading sections of it from Oracle as needed.
>> 
>> 
>> -thomas
>> 
> 
> Thomas's comment raises a question:
> 
> Can comeone give me an example (perhaps in a private response, since I'm off
> topic here) where one actually needs all cases in a large data set ("large"
> being > 1e6, say) to do a STATISTICAL analysis? By "statistical" I exclude,
> say searching for some particular characteristic like an adverse event in a
> medical or customer repair database, etc. Maybe a definition of
> "statistical" is: anything that cannot be routinely done in a single pass
> database query.
> 
> The reason I ask this is that it seems to me that with millions of cases,
> (careful, perhaps stratified or in some other not completely at random way)
> sampling should always suffice to reduce a dataset to manageable size
> sufficient for the data analysis needs at hand. But my ignorance and naivete
> probably show here.
> 
> Thanks.
> 
> -- Bert
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>




More information about the R-help mailing list