[Rd] arbitrary size data frame or other stcucts, curious about issues invovled.

Simon Urbanek simon.urbanek at r-project.org
Tue Jun 21 18:26:12 CEST 2011


Mike,

this is all nice, but AFAICS the first part misses the point that there is no 64-bit integer type in the API so there is simply no alternative at the moment. You just said that you don't like it, but you failed to provide a solution ... As for the second part, the idea is not new and is noble, but AFAIK no one so far was able to draft any good proposal as of what the API would look like. It would be very desirable if someone did, though. (BTW your link is useless - linking google searches is pointless as the results vary by request location, user setting etc.).

Cheers,
Simon


On Jun 21, 2011, at 6:33 AM, Mike Marchywka wrote:

> Thanks,
> 
> http://cran.r-project.org/doc/manuals/R-ints.html#Future-directions
> 
> Normally I'd take more time to digest these things before commenting but
> a few things struck me right away. First, use of floating point or double 
> as a replacement for int strikes me as "going the wrong way" as often
> to get predictable performance you try to tell the compiler you have
> ints rather than any floating time for which it is free to "round."  This
> is even ignoring any performance issue. The other thing is that scaling
> should not just be an issue of "make everything bigger" as the growth in
> both data needs and computer resources is not uniform. 
> 
> I guess my first thought to these constraints and resource issues
> is to consider a paged dataframe depending upon the point at which
> the 32-bit int constraint is imposed. A random access data struct 
> does not always get accessed randomly, and often it is purely sequential.
> Further down the road, it would be nice if algorithms were implemented in a
> block mode or could communicate their access patterns to the ds or
> at least tell it to prefetch things that should be needed soon. 
> 
> I guess I'm thinking mostly along the lines of things I've seen from Intel
> such as ( first things I could find on their site as I have not looked in detail
> in quite a while),
> 
> 
> http://www.google.com/search?hl=en&source=hp&q=site%3Aintel.com+performance+optimization
> 
> as once you get around thrashing virtual memory, you'd like to preserve the
> lower level memory cache hit rates too etc. These are probably not just niceities, 
> at least with VM, as personally I've seen impl related speed issues make simple analyses impractical.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> Subject: RE: arbitrary size data frame or other stcucts, curious about issues invovled.
>> From: jayemerson at gmail.com
>> To: marchywka at hotmail.com; r-devel at r-project.org
>> 
>> Mike,
>> 
>> 
>> Neither bigmemory nor ff are "drop in" solutions -- though useful,
>> they are primarily for data storage and management and allowing
>> convenient access to subsets of the data.  Direct analysis of the full
>> objects via most R functions is not possible.  There are many issues
>> that could be discussed here (and have, previously), including the use
>> of 32-bit integer indexing.  There is a nice section "Future
>> Directions" in the R Internals manual that you might want to look at.
>> 
>> Jay
>> 
>> 
>> -------------------------------------  Original message:
>> 
>> We keep getting questions on r-help about memory limits  and
>> I was curious to know what issues are involved in making
>> common classes like dataframe work with disk and intelligent
>> swapping? That is, sure you can always rely on OS for VM
>> but in theory it should be possible to make a data structure
>> that somehow knows what pieces you will access next and
>> can keep thos somewhere fast. Now of course algorithms
>> "should" act locally and be block oriented but in any case
>> could communicate with data structures on upcoming
>> access patterns, see a few ms into the future and have the
>> right stuff prefetched.
>> 
>> I think things like "bigmemory" exist but perhaps one
>> issue was that this could not just drop in for data.frame
>> or does it already solve all the problems?
>> 
>> Is memory management just a non-issue or is there something
>> that needs to be done  to make large data structures work well?
>> 
>> 
>> -- 
>> John W. Emerson (Jay)
>> Associate Professor of Statistics
>> Department of Statistics
>> Yale University
>> http://www.stat.yale.edu/~jay
> 		 	   		  
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 



More information about the R-devel mailing list