[R] Can I improve the efficiency of my scan() command?
Pierre Kleiber
pkleiber at honlab.nmfs.hawaii.edu
Sat Apr 12 00:07:49 CEST 2003
Ko-Kang Kevin Wang wrote:
> Hi,
>
> Suppose I use the following codes to read in a data set.
>
> ###############################################
>
>>rating <- scan("../Data/Rating.csv",
>
> + what = list(
> + usage = "",
> + mileage = 0,
> + sex = "",
> + excess = "",
> + ncd = "",
> + primage = "",
> + minage = "",
> + drivers = "",
> + district = "",
> + cargroup = "",
> + car.age = 0,
> + wsclms = "",
[...]
>
> #########################################################################
>
> It worked all right, but I'm just wondering if there is a more efficient
> way (it takes about 10 minutes to run the above scripts, for my 300,000 x
> 25 CSV file)?
>
> For example, the CSV file has 25 columns but I don't need 3 of them (6, 7,
> and 22). What I have done is to scan them in anyway, convert the list
> into a data frame then remove the 3 columns. Just wonder if it is
> possible to simply ignore them in scan() to make the process faster?
>
It might not make a lot of difference in your case where you are
reading many fields and want to ignore a few, but if you want to read
a few out of many, it would help to preprocess the input file using,
for example, awk as in the following, which would pick up fields 1, 2,
and 4:
> con <- pipe("awk -F , '{print $1,$3 $4}' ../Data/Rating.csv")
> rating <- scan(con, what = list(
+ usage = "",
+ mileage = 0,
+ excess = "")
+ , quiet = TRUE, skip = 1)
> close(con)
I do this sort of thing a lot using various utilities; so I've defined
the following function to take care of opening and closing the
connection:
scanpipe <- function(x,...) {
con <- pipe(x)
out <- scan(con,...)
close(con)
out
}
--
-----------------------------------------------------------------
Pierre Kleiber Email: pkleiber at honlab.nmfs.hawaii.edu
Fishery Biologist Tel: 808 983-5399/737-7544
NOAA FISHERIES - Honolulu Laboratory Fax: 808 983-2902
2570 Dole St., Honolulu, HI 96822-2396
-----------------------------------------------------------------
More information about the R-help
mailing list