[R] Can I improve the efficiency of my scan() command?
Ko-Kang Kevin Wang
kwan022 at stat.auckland.ac.nz
Fri Apr 11 23:23:12 CEST 2003
Hi,
Suppose I use the following codes to read in a data set.
###############################################
> rating <- scan("../Data/Rating.csv",
+ what = list(
+ usage = "",
+ mileage = 0,
+ sex = "",
+ excess = "",
+ ncd = "",
+ primage = "",
+ minage = "",
+ drivers = "",
+ district = "",
+ cargroup = "",
+ car.age = 0,
+ wsclms = "",
+ adclms = "",
+ ftclms = "",
+ pdclms = "",
+ piclms = "",
+ adincur = 0,
+ pdincur = 0,
+ wsincur = 0,
+ ftincur = 0,
+ piincur = 0,
+ record = 0,
+ days = 0,
+ minagen = 0,
+ primagen = 0),
+ sep=",", quiet = TRUE, skip = 1)
> rating.df <- as.data.frame(rating)
> rating.df <- rating.df[, c(-6, -7, -22)]
> attach(rating.df)
> summary(rating.df)
usage mileage sex excess ncd drivers
S :125788 Min. : 288 F: 82208 0 : 4744 0: 880 1:100791
SB: 12581 1st Qu.: 5000 M:217792 100:161311 1: 2819 2:175100
SC:161524 Median : 8000 75 :133945 2: 5245 3: 19146
ST: 107 Mean : 7640 3: 5230 4: 4156
3rd Qu.:10000 4:285826 5: 515
Max. :40000 6: 69
7: 223
district cargroup car.age wsclms adclms
6 :59053 8 :44524 Min. :-1.000 0:294521 0:292852
5 :57113 6 :39171 1st Qu.: 4.000 1: 5267 1: 6720
7 :51166 9 :38965 Median : 7.000 2: 201 2: 405
4 :50643 7 :35139 Mean : 7.234 3: 11 3: 23
3 :33041 10 :31091 3rd Qu.:10.000
8 :16437 5 :27456 Max. :30.000
(Other):32547 (Other):83654
ftclms pdclms piclms adincur pdincur
0:298661 :281056 :281056 Min. : 0.00 Min. : -4985.2
1: 1316 0: 15277 0: 18131 1st Qu.: 0.00 1st Qu.: 0.0
2: 22 1: 3587 1: 809 Median : 0.00 Median : 0.0
3: 1 2: 79 2: 4 Mean : 21.25 Mean : 225.4
3: 1 3rd Qu.: 0.00 3rd Qu.: 0.0
Max. :13779.55 Max. : 25050.0
NA's :281056.0
wsincur ftincur piincur days
Min. : 0.00 Min. : 0.000 Min. : 0.0 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.:123.0
Median : 0.00 Median : 0.000 Median : 0.0 Median :340.0
Mean : 2.07 Mean : 5.183 Mean : 345.8 Mean :248.7
3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.0 3rd Qu.:364.0
Max. :2004.64 Max. :25082.910 Max. :484550.1 Max. :365.0
NA's :281056.0
minagen primagen
Min. :17.00 Min. :17.00
1st Qu.:41.00 1st Qu.:43.00
Median :56.00 Median :53.00
Mean :63.81 Mean :53.25
3rd Qu.:99.00 3rd Qu.:64.00
Max. :99.00 Max. :93.00
#########################################################################
It worked all right, but I'm just wondering if there is a more efficient
way (it takes about 10 minutes to run the above scripts, for my 300,000 x
25 CSV file)?
For example, the CSV file has 25 columns but I don't need 3 of them (6, 7,
and 22). What I have done is to scan them in anyway, convert the list
into a data frame then remove the 3 columns. Just wonder if it is
possible to simply ignore them in scan() to make the process faster?
--
Cheers,
Kevin
------------------------------------------------------------------------------
/* Time is the greatest teacher, unfortunately it kills its students */
--
Ko-Kang Kevin Wang
Master of Science (MSc) Student
SLC Tutor and Lab Demonstrator
Department of Statistics
University of Auckland
New Zealand
Homepage: http://www.stat.auckland.ac.nz/~kwan022
Ph: 373-7599
x88475 (City)
x88480 (Tamaki)
More information about the R-help
mailing list