[Rd] loading multiple CSV files into a single data frame

Cook, Malcolm MEC at stowers.org
Fri May 4 00:09:03 CEST 2012


Victor,

I understand you as follows

	The first two columns of the desired combined dataframe are the last two
levels of the pathname to the csv file.

	The columns in all the data.csv files are the same, namely, there is only
one column, and it is named PERF.

If so, the following should work (on unix)

do.call(rbind,lapply(Sys.glob('results/*/*/data.csv'),function(path)
{within(read.csv(path),{ SIZE<-basename(dirname(path));
ASSOC<-basename(dirname(dirname(path)))})}))


On 5/3/12 4:40 PM, "victor jimenez" <betabandido at gmail.com> wrote:

>First of all, thank you for the answers. I did not know about zoo.
>However,
>it seems that none approach can do what I exactly want (please, correct me
>if I am wrong).
>
>Probably, it was not clear in my original question. The CSV files only
>contain the performance values. The other two columns (ASSOC and SIZE) are
>obtained from the existing values in the directory tree. So, in my
>opinion,
>none of the proposed solutions would work, unless every single "data.csv"
>file contained all the three columns (ASSOC, SIZE and PERF).
>
>In my case, my experimentation framework basically outputs a CSV with some
>values read from the processor's performance counters (PMCs). For each
>cache size and associativity I conduct an experiment, creating a CSV file,
>and placing that file into its own directory. I could modify the
>experimentation framework, so that it also outputs the cache size and
>associativity, but that may not be ideal in some circumstances and I also
>have a significant amount of old results and I want keep using them
>without
>manually fixing the CSV files.
>
>Has anyone else faced such a situation? Any good solutions?
>
>Thank you,
>Victor
>
>On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
><ggrothendieck at gmail.com>wrote:
>
>> On Thu, May 3, 2012 at 2:07 PM, victor jimenez <betabandido at gmail.com>
>> wrote:
>> > Sometimes I have hundreds of CSV files scattered in a directory tree,
>> > resulting from experiments' executions. For instance, giving an
>>example
>> > from my field, I may want to collect the performance of a processor
>>for
>> > several design parameters such as "cache size" (possible values: 2,
>>4, 8
>> > and 16) and "cache associativity" (possible values: direct-mapped,
>>4-way,
>> > fully-associative). The results of all these experiments will be
>>stored
>> in
>> > a directory tree like:
>> >
>> > results
>> >  |-- direct-mapped
>> >  |       |-- 2 -- data.csv
>> >  |       |-- 4 -- data.csv
>> >  |       |-- 8 -- data.csv
>> >  |       |-- 16 -- data.csv
>> >  |-- 4-way
>> >  |       |-- 2 -- data.csv
>> >  |       |-- 4 -- data.csv
>> > ...
>> >  |-- fully-associative
>> >  |       |-- 2 -- data.csv
>> >  |       |-- 4 -- data.csv
>> > ...
>> >
>> > I am developing a package that would allow me to gather all those CSV
>> into
>> > a single data frame. Currently, I just need to execute the following
>> > statement:
>> >
>> > dframe <- gather("results/@ASSOC@/@SIZE@/data.csv")
>> >
>> > and this command returns a data frame containing the columns ASSOC,
>>SIZE
>> > and all the remaining columns inside the CSV files (in my case the
>> > processor performance), effectively loading all the CSV files into a
>> single
>> > data frame. So, I would get something like:
>> >
>> > ASSOC,          SIZE, PERF
>> > direct-mapped,       2,     1.4
>> > direct-mapped,       4,     1.6
>> > direct-mapped,       8,     1.7
>> > direct-mapped,     16,     1.7
>> > 4-way,                   2,     1.4
>> > 4-way,                   4,     1.5
>> > ...
>> >
>> > I would like to ask whether there is any similar functionality already
>> > implemented in R. If so, there is no need to reinvent the wheel :)
>> > If it is not implemented and the R community believes that this
>>feature
>> > would be useful, I would be glad to contribute my code.
>> >
>>
>> If your csv files all have the same columns and represent time series
>> then read.zoo in the zoo package can read multiple csv files in at
>> once using a single read.zoo command producing a single zoo object.
>>
>> library(zoo)
>> ?read.zoo
>> vignette("zoo-read")
>>
>> Also see the other zoo vignettes and help files.
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-devel at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list