[R-SIG-Finance] Mode list to mode numerical.... fast..

Dirk Eddelbuettel edd at debian.org
Thu May 1 16:26:49 CEST 2014


On 1 May 2014 at 09:04, Joshua Ulrich wrote:
| On Thu, May 1, 2014 at 8:47 AM, Steve Greiner <sgreiner at factset.com> wrote:
| > Okay, I've had it!!!..   Every time I read in a dataset using something like:
| > returnmatrix = read.csv("S&P.csv", header=TRUE, sep=",")
| >
| > It comes back with "returnmatrix" as mode list.   How can I quickly convert the dataset to mode numerical?   This is pissing me off.  I can do it manually by creating a new matrix and assigning values of the list matrix to the values of the numerical matrix element by element, but it's time consuming.  What can anybody recommend me?
| 
| ?read.csv says it returns a data.frame (which is a list with some
| specific attributes).  If you want to convert it to a matrix, just
| use:
| returnmatrix = as.matrix(read.csv("S&P.csv", header=TRUE, sep=","))
| 
| You don't say exactly what data "S&P.csv" contains... but if it's a
| large matrix, then you can get some fairly substantial performance
| improvement by following the advice in the "Memory usage" section of
| ?read.csv, which says:
| 
| 'read.table' is not the right tool for reading large matrices,
| especially those with many columns: it is designed to read _data
| frames_ which may have columns of very different classes.  Use
| 'scan' instead for matrices.
| 
| So you could try something like:
| 
| column_names = scan("S&P.csv", n=1, sep=",", what="")
| returnmatrix = matrix(scan("S&P.csv", skip=1, sep=","),
| ncol=length(column_names), dimnames=list(NULL, column_names))
| 
| You might need to specify byrow=TRUE in the above matrix() call... I
| can't remember off the top of my head.

All very good points. But if your file is large (not uncommon in finance)
consider alternatives such as fread in the data.table package, or direct
connections to the underlying server / service, or batch jobs doing the
parsing once and then storing as binary files (R's RDS format is good) etc

To me use of csv files is a last resort used chiefly for one-off
explorations. For "production" one can do much better.

Dirk


-- 
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com



More information about the R-SIG-Finance mailing list