[Rd] loading multiple CSV files into a single data frame
Simon Urbanek
simon.urbanek at r-project.org
Fri May 4 00:09:36 CEST 2012
On May 3, 2012, at 5:40 PM, victor jimenez wrote:
> First of all, thank you for the answers. I did not know about zoo. However,
> it seems that none approach can do what I exactly want (please, correct me
> if I am wrong).
>
> Probably, it was not clear in my original question. The CSV files only
> contain the performance values. The other two columns (ASSOC and SIZE) are
> obtained from the existing values in the directory tree. So, in my opinion,
> none of the proposed solutions would work, unless every single "data.csv"
> file contained all the three columns (ASSOC, SIZE and PERF).
>
> In my case, my experimentation framework basically outputs a CSV with some
> values read from the processor's performance counters (PMCs). For each
> cache size and associativity I conduct an experiment, creating a CSV file,
> and placing that file into its own directory. I could modify the
> experimentation framework, so that it also outputs the cache size and
> associativity, but that may not be ideal in some circumstances and I also
> have a significant amount of old results and I want keep using them without
> manually fixing the CSV files.
>
You don't need to touch the CSV files, simply add values at load time - this is all easily doable in one line ;)
> do.call("rbind",lapply(Sys.glob("*/*/data.csv"),function(d) cbind(read.csv(d),as.data.frame(t(strsplit(d,"/")[[1]])))))
A B V1 V2 V3
1 1 2 1 a data.csv
2 3 4 1 a data.csv
3 1 2 1 b data.csv
4 3 4 1 b data.csv
5 1 2 2 a data.csv
6 3 4 2 a data.csv
> Has anyone else faced such a situation? Any good solutions?
>
> Thank you,
> Victor
>
> On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
> <ggrothendieck at gmail.com>wrote:
>
>> On Thu, May 3, 2012 at 2:07 PM, victor jimenez <betabandido at gmail.com>
>> wrote:
>>> Sometimes I have hundreds of CSV files scattered in a directory tree,
>>> resulting from experiments' executions. For instance, giving an example
>>> from my field, I may want to collect the performance of a processor for
>>> several design parameters such as "cache size" (possible values: 2, 4, 8
>>> and 16) and "cache associativity" (possible values: direct-mapped, 4-way,
>>> fully-associative). The results of all these experiments will be stored
>> in
>>> a directory tree like:
>>>
>>> results
>>> |-- direct-mapped
>>> | |-- 2 -- data.csv
>>> | |-- 4 -- data.csv
>>> | |-- 8 -- data.csv
>>> | |-- 16 -- data.csv
>>> |-- 4-way
>>> | |-- 2 -- data.csv
>>> | |-- 4 -- data.csv
>>> ...
>>> |-- fully-associative
>>> | |-- 2 -- data.csv
>>> | |-- 4 -- data.csv
>>> ...
>>>
>>> I am developing a package that would allow me to gather all those CSV
>> into
>>> a single data frame. Currently, I just need to execute the following
>>> statement:
>>>
>>> dframe <- gather("results/@ASSOC@/@SIZE@/data.csv")
>>>
>>> and this command returns a data frame containing the columns ASSOC, SIZE
>>> and all the remaining columns inside the CSV files (in my case the
>>> processor performance), effectively loading all the CSV files into a
>> single
>>> data frame. So, I would get something like:
>>>
>>> ASSOC, SIZE, PERF
>>> direct-mapped, 2, 1.4
>>> direct-mapped, 4, 1.6
>>> direct-mapped, 8, 1.7
>>> direct-mapped, 16, 1.7
>>> 4-way, 2, 1.4
>>> 4-way, 4, 1.5
>>> ...
>>>
>>> I would like to ask whether there is any similar functionality already
>>> implemented in R. If so, there is no need to reinvent the wheel :)
>>> If it is not implemented and the R community believes that this feature
>>> would be useful, I would be glad to contribute my code.
>>>
>>
>> If your csv files all have the same columns and represent time series
>> then read.zoo in the zoo package can read multiple csv files in at
>> once using a single read.zoo command producing a single zoo object.
>>
>> library(zoo)
>> ?read.zoo
>> vignette("zoo-read")
>>
>> Also see the other zoo vignettes and help files.
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
More information about the R-devel
mailing list