[R] Creating a custom connection to read from multiple files
Tomas Kalibera
kalibera at nenya.ms.mff.cuni.cz
Fri Jan 21 14:30:21 CET 2005
Hello Andy,
thanks for your examples, I rewrote everything to matrices &
lapply/sapply, rbind calls instead of for-cycles & appends, it really
helped. Reading files one by one and concatenating is now even faster
than concatenating on disk, that 8MB table is read in 3.5 seconds.
Tomas
>>rbind is vectorized so you are using it (way) suboptimally.
>>
>>
>
>Here's an example:
>
>
>
>> ## Create a 500 x 100 data matrix.
>> x <- matrix(rnorm(5e4), 500, 100)
>> ## Generate 50 filenames.
>> fname <- paste("f", formatC(1:50, width=2, flag="0"), ".txt", sep="")
>> ## Write the data to files 50 times.
>> for (f in fname) write(t(x), file=f, ncol=ncol(x))
>>
>> ## Read the files into a list of data frames.
>> system.time(datList <- lapply(fname, read.table, header=FALSE),
>>
>>
>gcFirst=TRUE)
>[1] 11.91 0.05 12.33 NA NA
>
>
>> ## Specify colClasses to speed up.
>> system.time(datList <- lapply(fname, read.table,
>>
>>
>colClasses=rep("numeric", 100)),
>+ gcFirst=TRUE)
>[1] 10.69 0.07 10.79 NA NA
>
>
>> ## Stack them together.
>> system.time(dat <- do.call("rbind", datList), gcFirst=TRUE)
>>
>>
>[1] 5.34 0.09 5.45 NA NA
>
>
>>
>> ## Use matrices instead of data frames.
>> system.time(datList <- lapply(fname,
>>
>>
>+ function(f) matrix(scan(f), ncol=100, byrow=TRUE)), gcFirst=TRUE)
>Read 50000 items
>...
>Read 50000 items
>[1] 9.49 0.08 15.06 NA NA
>
>
>> system.time(dat <- do.call("rbind", datList), gcFirst=TRUE)
>>
>>
>[1] 0.09 0.03 0.12 NA NA
>
>
>> ## Clean up the files.
>> unlink(fname)
>>
>>
>
>A couple of points:
>
>- Usually specifying colClasses will make read.table() quite a bit
> faster, even though it's only marginally faster here. Look back
> in the list archive to see examples.
>
>- If your data files are all numerics (as in this example),
> storing them in matrices will be much more efficient. Note
> the difference in rbind()ing the 50 data frames and 50
> matrices (5.34 seconds vs. 0.09!). rbind.data.frame()
> needs to ensure that the resulting data frame has unique
> rownames (a requirement for a legit data frame), and
> that's probably taking a big chunk of the time.
>
>Andy
>
>
>
>
More information about the R-help
mailing list