[Rd] loading multiple CSV files into a single data frame

Simon Urbanek simon.urbanek at r-project.org
Fri May 4 00:09:36 CEST 2012


On May 3, 2012, at 5:40 PM, victor jimenez wrote:

> First of all, thank you for the answers. I did not know about zoo. However,
> it seems that none approach can do what I exactly want (please, correct me
> if I am wrong).
> 
> Probably, it was not clear in my original question. The CSV files only
> contain the performance values. The other two columns (ASSOC and SIZE) are
> obtained from the existing values in the directory tree. So, in my opinion,
> none of the proposed solutions would work, unless every single "data.csv"
> file contained all the three columns (ASSOC, SIZE and PERF).
> 
> In my case, my experimentation framework basically outputs a CSV with some
> values read from the processor's performance counters (PMCs). For each
> cache size and associativity I conduct an experiment, creating a CSV file,
> and placing that file into its own directory. I could modify the
> experimentation framework, so that it also outputs the cache size and
> associativity, but that may not be ideal in some circumstances and I also
> have a significant amount of old results and I want keep using them without
> manually fixing the CSV files.
> 

You don't need to touch the CSV files, simply add values at load time - this is all easily doable in one line ;)

> do.call("rbind",lapply(Sys.glob("*/*/data.csv"),function(d) cbind(read.csv(d),as.data.frame(t(strsplit(d,"/")[[1]])))))
  A B V1 V2       V3
1 1 2  1  a data.csv
2 3 4  1  a data.csv
3 1 2  1  b data.csv
4 3 4  1  b data.csv
5 1 2  2  a data.csv
6 3 4  2  a data.csv


> Has anyone else faced such a situation? Any good solutions?
> 
> Thank you,
> Victor
> 
> On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
> <ggrothendieck at gmail.com>wrote:
> 
>> On Thu, May 3, 2012 at 2:07 PM, victor jimenez <betabandido at gmail.com>
>> wrote:
>>> Sometimes I have hundreds of CSV files scattered in a directory tree,
>>> resulting from experiments' executions. For instance, giving an example
>>> from my field, I may want to collect the performance of a processor for
>>> several design parameters such as "cache size" (possible values: 2, 4, 8
>>> and 16) and "cache associativity" (possible values: direct-mapped, 4-way,
>>> fully-associative). The results of all these experiments will be stored
>> in
>>> a directory tree like:
>>> 
>>> results
>>> |-- direct-mapped
>>> |       |-- 2 -- data.csv
>>> |       |-- 4 -- data.csv
>>> |       |-- 8 -- data.csv
>>> |       |-- 16 -- data.csv
>>> |-- 4-way
>>> |       |-- 2 -- data.csv
>>> |       |-- 4 -- data.csv
>>> ...
>>> |-- fully-associative
>>> |       |-- 2 -- data.csv
>>> |       |-- 4 -- data.csv
>>> ...
>>> 
>>> I am developing a package that would allow me to gather all those CSV
>> into
>>> a single data frame. Currently, I just need to execute the following
>>> statement:
>>> 
>>> dframe <- gather("results/@ASSOC@/@SIZE@/data.csv")
>>> 
>>> and this command returns a data frame containing the columns ASSOC, SIZE
>>> and all the remaining columns inside the CSV files (in my case the
>>> processor performance), effectively loading all the CSV files into a
>> single
>>> data frame. So, I would get something like:
>>> 
>>> ASSOC,          SIZE, PERF
>>> direct-mapped,       2,     1.4
>>> direct-mapped,       4,     1.6
>>> direct-mapped,       8,     1.7
>>> direct-mapped,     16,     1.7
>>> 4-way,                   2,     1.4
>>> 4-way,                   4,     1.5
>>> ...
>>> 
>>> I would like to ask whether there is any similar functionality already
>>> implemented in R. If so, there is no need to reinvent the wheel :)
>>> If it is not implemented and the R community believes that this feature
>>> would be useful, I would be glad to contribute my code.
>>> 
>> 
>> If your csv files all have the same columns and represent time series
>> then read.zoo in the zoo package can read multiple csv files in at
>> once using a single read.zoo command producing a single zoo object.
>> 
>> library(zoo)
>> ?read.zoo
>> vignette("zoo-read")
>> 
>> Also see the other zoo vignettes and help files.
>> 
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 



More information about the R-devel mailing list