[Rd] Importing csv files
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Thu Dec 23 17:43:07 CET 2004
Prof Brian Ripley wrote:
> I think we need to know what you mean by `large' and why read.table is
> not fast enough (and hence if some of the planned improvements might be
> all that is needed).
I was referring to the e-mail exchanges on r-help about read.table a few
weeks ago, then there was a new discussion the other day concerning RAM
usage and read.table not knowing the number of rows up front. I believe
that the posters provided some timings and examples.
>
> Could you make some examples available for profiling?
>
> It seems to me that there are some delicate licensing issues in
> distributing a product that writes .rda format except under GPL. See,
> for example, the GPL FAQ.
My understanding is that David is not distributing dataload any more,
though I would not like to discourage commercial vendors (such as
providers of Stat/Transfer and DBMSCOPY) from providing .rda output as
an option. I assume that new code written under GPL would not be a
problem. -Frank
>
> On Thu, 23 Dec 2004, Frank E Harrell Jr wrote:
>
>> There is a recurring need for importing large csv files quickly.
>> David Baird's dataload is a standalone program that will directly
>> create .rda files from .csv (it also handles many other conversions).
>> Unfortunately dataload is no longer publicly available because of some
>> kind of relationship with Stat/Transfer. The idea is a good one,
>> though. I wonder if anyone would volunteer to replicate the csv->rda
>> standalone functionality or to provide some Perl or Python tools for
>> making creation of .rda files somewhat easy outside of R.
>>
>> As an aside, I routinely see 30-fold reductions in file sizes for .rda
>> files (made with save(..., compress=TRUE)) compared with the size of
>> SAS binary datasets. And load( ) times are fast.
>>
>> It's been a great year for R. Let me take this opportunity to thank
>> the R leaders for a fantastic job that gives immeasurable benefits to
>> the community.
>
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-devel
mailing list