[R] the large dataset problem
Ben Bolker
bolker at ufl.edu
Mon Jul 30 18:42:59 CEST 2007
Eric Doviak <edoviak <at> earthlink.net> writes:
>
> Dear useRs,
>
> I recently began a job at a very large and heavily bureaucratic organization.
We're setting up a research
> office and statistical analysis will form the backbone of our work. We'll be
working with large datasets
> such the SIPP as well as our own administrative data.
We need to know more about what you need to do with those
large data sets in order to help -- giving some specific
examples would be useful. In many situations you can set up a database
connection or use Perl to select carefully and only load the
observations/variables you need into R, but it's hard to make
completely general suggestions.
I'm not sure what the purpose of your code to read a few
lines of a data file and write it to a CSV file is ... ?
"Vectorizing" your code is figuring out a way to tell R
how to do what you want as a single 'vector' operation -- for
example to remove NAs from a vector you could do this:
newvec = numeric(0)
for (i in seq(along=oldvec)) {
if (!is.na(oldvec[i])) newvec = c(newvec,oldvec[i])
}
but this would be incredibly slow --
newvec = oldvec[!is.na(oldvec)]
or
newvec = na.omit(oldvec)
would be far faster.
More information about the R-help
mailing list