[Rd] Importing csv files
Prof Brian Ripley
ripley at stats.ox.ac.uk
Thu Dec 23 18:31:37 CET 2004
On Thu, 23 Dec 2004, Frank E Harrell Jr wrote:
> Prof Brian Ripley wrote:
>> I think we need to know what you mean by `large' and why read.table is not
>> fast enough (and hence if some of the planned improvements might be all
>> that is needed).
>
> I was referring to the e-mail exchanges on r-help about read.table a few
> weeks ago, then there was a new discussion the other day concerning RAM usage
> and read.table not knowing the number of rows up front. I believe that the
> posters provided some timings and examples.
I have yet to see any which used read.table competently which were slow
(although the RAM usage could be higher than some people expected).
Unless people have followed _all_ the hints in the Data manual, I don't
think there is anything to discuss.
There is an issue with reading factors with just a few unique values, but
that is one of the things being worked on.
>> Could you make some examples available for profiling?
Anyone who actually has a problem, then?
>> It seems to me that there are some delicate licensing issues in
>> distributing a product that writes .rda format except under GPL. See, for
>> example, the GPL FAQ.
>
> My understanding is that David is not distributing dataload any more, though
> I would not like to discourage commercial vendors (such as providers of
> Stat/Transfer and DBMSCOPY) from providing .rda output as an option. I
> assume that new code written under GPL would not be a problem. -Frank
I said `except under GPL'. I am not trying to discourage anyone, just
pointing out that GPL has far-ranging implications that are often
over-looked.
>> On Thu, 23 Dec 2004, Frank E Harrell Jr wrote:
>>
>>> There is a recurring need for importing large csv files quickly. David
>>> Baird's dataload is a standalone program that will directly create .rda
>>> files from .csv (it also handles many other conversions). Unfortunately
>>> dataload is no longer publicly available because of some kind of
>>> relationship with Stat/Transfer. The idea is a good one, though. I
>>> wonder if anyone would volunteer to replicate the csv->rda standalone
>>> functionality or to provide some Perl or Python tools for making creation
>>> of .rda files somewhat easy outside of R.
>>>
>>> As an aside, I routinely see 30-fold reductions in file sizes for .rda
>>> files (made with save(..., compress=TRUE)) compared with the size of SAS
>>> binary datasets. And load( ) times are fast.
>>>
>>> It's been a great year for R. Let me take this opportunity to thank the R
>>> leaders for a fantastic job that gives immeasurable benefits to the
>>> community.
It's certainly been a great year for people to complain about R, R-help
.... We say
R is a collaborative project with many contributors.
but it seems to me much less than it used to be.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list