[R-SIG-Finance] R Memory Usage

Fri Apr 15 16:36:54 CEST 2011

Since we seem to be top-posting for this thread, I'll continue (ick).

Typically, we store daily, minute, or tick data as xts objects, one per 
instrument.  Much as you would get from getSymbols in quantmod, for 
comparison.  (we've written getSymbols methods that are in the 
FinancialInstrument package that are more amenable to disk-based 
persistent storage of tick data).

When necessary, we subset, align, and cbind this data into combined xts 
objects to get multi-column xts objects.  About the only thing it is 
convenient to do with a data.frame that is inconvenient in xts is data 
which contains mixed numeric and text data.  I typically still use xts 
for these if they will be large objects, but you then have to be aware 
that (like with a matrix) all your numeric data will be stored as 
character data, and you'll need to use as.numeric.  If you only have 
numeric data, this proviso does not apply.  You'll find that merge, 
cbind, and rbind on xts are massively more efficient than the data.frame 
equivalents.

With your example of 800 stocks, I would likely store each stock as a 
separate xts object, and subset and bind as necessary for your analysis. 
  Perhaps in an environment, to keep from cluttering the .GlobalEnv, and 
make it easier to save/load all your data at once.

We avoid data.frame for any object that doesn't absolutely require the 
mixed types and factor support of data.frame.  It's too inefficient in 
memory and speed for truly large data.

Regards,

   - Brian

-- 
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock

On 04/15/2011 09:19 AM, Elliot Joel Bernstein wrote:
> Jeff -
>
> Thanks for your feedback. I was attempting to use data frame, and that -- specifically the use of the 'merge' function -- seemed to be the root of the problem. I read the xts vignette, and it looks interesting, but it's not clear how I should use it for my data. The example in the vignette (using 'sample_matrix') seems to have several variables ('Open', 'Close', etc.) measured over time for a single stock. How would you handle multiple variables measured on multiple stocks over time? Ideally I think I would like to have multiple matrices contained in the xts object, one for each variable, with rows indexing time and columns indexing stocks (or a 3-D array, with the third dimension indexing the variable).
>
> Thanks.
>
> - Elliot
>
> On Sun, Apr 10, 2011 at 02:14:42PM -0500, Jeffrey Ryan wrote:
>> Elliot,
>>
>> One of the advantages to posting to the finance list is that those of
>> us who work around large data in finance can comment on tools that you
>> use as well.
>>
>> One thing you didn't mention specifically was which packages you are
>> using and maybe examples of specific code you are calling.
>>
>> Within financial time-series, one of the most optimized tools is xts -
>> precisely for the reason of memory management and optimizations for
>> large data.  Using something ad-hoc, for example strings and
>> data.frames - would cause tremendous issues.
>>
>> Another issue would be whether or not you need the full data resident
>> in memory at all times.  R's rds format, or a database, or use of
>> out-of-core objects such as with mmap or indexing - can greatly
>> improve things.
>>
>> If you are able to come to the R/Finance conference in Chicago on the
>> 29th and 30th of this month, you'll have a chance to talk to some of
>> those 'in the trenches' with respect to using R on big data.  And as
>> you point our (as well as Brian) - 800x3000 isn't very large, so your
>> case isn't unique.
>>
>> Would be great to see you later this month in Chicago!  www.RinFinance.com
>>
>> Best,
>> Jeff
>>
>>
>>
>> On Sun, Apr 10, 2011 at 10:49 AM, Elliot Joel Bernstein
>> <elliot.bernstein at fdopartners.com>  wrote:
>>> This is not specifically a finance question, but I'm working with financial
>>> data (daily stock returns), and I suspect many people using R for financial
>>> analysis face similar issues. The basic problem I'm having is that with a
>>> moderately large data set (800 stocks x 11 years), performing a few
>>> operations such as data transformations, fitting regressions, etc., results
>>> in R using an enormous amount of memory -- sometimes upwards of 5GB -- even
>>> after using gc() to try and free some memory up. I've read several posts to
>>> various R mailing lists over the years indicating that R does not release
>>> memory back to the system on certain OSs (64 bit Linux in my case), so I
>>> understand that this is "normal" behavior for R. How do people typically
>>> work around this to do exploratory analysis on large data sets without
>>> having to constantly restart R to free up memory?
>>>
>>> Thanks.
>>>
>>> - Elliot Joel Bernstein
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-SIG-Finance at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-finance
>>> -- Subscriber-posting only. If you want to post, subscribe first.
>>> -- Also note that this is not the r-help list where general R questions should go.
>>>
>>
>>
>>
>> --
>> Jeffrey Ryan
>> jeffrey.ryan at lemnica.com
>>
>> www.lemnica.com
>>
>> R/Finance 2011 April 29th and 30th in Chicago | www.RinFinance.com
>