[R] the large dataset problem

Peter Dalgaard p.dalgaard at biostat.ku.dk
Tue Jul 31 11:57:21 CEST 2007


(Ted Harding) wrote:
> On 30-Jul-07 11:40:47, Eric Doviak wrote:
>   
>> [...]
>>     
>
> Sympathies for the constraints you are operating in!
>
>   
>> The "Introduction to R" manual suggests modifying input files with
>> Perl. Any tips on how to get started? Would Perl Data Language (PDL) be
>> a good choice?  http://pdl.perl.org/index_en.html
>>     
>
> I've not used SIPP files, but itseems that they are available in
> "delimited" format, including CSV.
>
> For extracting a subset of fields (especially when large datasets may
> stretch RAM resources) I would use awk rather than perl, since it
> is a much lighter program, transparent to code for, efficient, and
> it will do that job.
>
> On a Linux/Unix system (see below), say I wanted to extract fields
> 1, 1000, 1275, .... , 5678 from a CSV file. Then the 'awk' line
> that would do it would look like
>
> awk '
>  BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) "," ... $(5678)
> ' < sippfile.csv > newdata.csv
>
> Awk reads one line at a tine, and does with it what you tell it to do.
>   
....

Yes, but notice that there are also options within R. If you use a 
carefully constructed colClasses= argument to 
read.table()/read.csv()/etc or what= argument to scan(), you don't get 
more columns than you ask for. The basic trick is to use "NULL" for each 
of the columns that you do NOT want, and preferably "numeric" or 
"character" or whatever for those that you want (NA lets read.table do 
it's usual trickery of guessing type from contents). However...
>   
>> I wrote a script which loads large datasets a few lines at a time,
>> writes the dozen or so variables of interest to a CSV file, removes
>> the loaded data and then (via a "for" loop) loads the next few lines
>> .... I managed to get it to work with one of the SIPP core files,
>> but it's SLOOOOW.
>>     
>
> See above ...
>
>   
Looking at the actual data files and data dictionaries (we're talking 
about http://www.bls.census.gov/sipp_ftp.html, right?), it looks like 
SIPP files are in a fixed-width format, which suggests that you might 
want  to employ read.fwf().  If you want to get really smart about it, 
extract the 'D' fields from the dictionary files

Try this

 dict <- readLines("ftp://www.sipp.census.gov/pub/sipp/2004/l04puw1d.txt")
 D.lines <- grep("^D ", dict)
 vdict <- read.table(con <- textConnection(dict[D.lines])); close(con)
 head(vdict)

a little bit of further fiddling and you have the list of field widths 
and variable names to feed to read.fwf(). Just subset the name list and 
set the field width negative for those variables that you wish to skip. 
Extracting value labels from the V fields looks like it could be done, 
but requires more thinking, especially where they straddle multiple 
lines (but hey, it's your job, not mine...)

    -Peter D.



More information about the R-help mailing list