[R] Can I improve the efficiency of my scan() command?

Fri Apr 11 23:23:12 CEST 2003

Hi,

Suppose I use the following codes to read in a data set.

###############################################
> rating <- scan("../Data/Rating.csv",
+                what = list(
+                  usage = "",
+                  mileage = 0,
+                  sex = "",
+                  excess = "",
+                  ncd = "",
+                  primage = "",
+                  minage = "",
+                  drivers = "",
+                  district = "",
+                  cargroup = "",
+                  car.age = 0,
+                  wsclms = "",
+                  adclms = "",
+                  ftclms = "",
+                  pdclms = "",
+                  piclms = "",
+                  adincur = 0,
+                  pdincur = 0,
+                  wsincur = 0,
+                  ftincur = 0,
+                  piincur = 0,
+                  record = 0,
+                  days = 0,
+                  minagen = 0,
+                  primagen = 0),
+                sep=",", quiet = TRUE, skip = 1)
> rating.df <- as.data.frame(rating)
> rating.df <- rating.df[, c(-6, -7, -22)]
> attach(rating.df)
> summary(rating.df)
 usage          mileage      sex        excess       ncd        drivers   
 S :125788   Min.   :  288   F: 82208   0  :  4744   0:   880   1:100791  
 SB: 12581   1st Qu.: 5000   M:217792   100:161311   1:  2819   2:175100  
 SC:161524   Median : 8000              75 :133945   2:  5245   3: 19146  
 ST:   107   Mean   : 7640                           3:  5230   4:  4156  
             3rd Qu.:10000                           4:285826   5:   515  
             Max.   :40000                                      6:    69  
                                                                7:   223  
    district        cargroup        car.age       wsclms     adclms    
 6      :59053   8      :44524   Min.   :-1.000   0:294521   0:292852  
 5      :57113   6      :39171   1st Qu.: 4.000   1:  5267   1:  6720  
 7      :51166   9      :38965   Median : 7.000   2:   201   2:   405  
 4      :50643   7      :35139   Mean   : 7.234   3:    11   3:    23  
 3      :33041   10     :31091   3rd Qu.:10.000                        
 8      :16437   5      :27456   Max.   :30.000                        
 (Other):32547   (Other):83654                                         
 ftclms     pdclms     piclms        adincur            pdincur        
 0:298661    :281056    :281056   Min.   :    0.00   Min.   : -4985.2  
 1:  1316   0: 15277   0: 18131   1st Qu.:    0.00   1st Qu.:     0.0  
 2:    22   1:  3587   1:   809   Median :    0.00   Median :     0.0  
 3:     1   2:    79   2:     4   Mean   :   21.25   Mean   :   225.4  
            3:     1              3rd Qu.:    0.00   3rd Qu.:     0.0  
                                  Max.   :13779.55   Max.   : 25050.0  
                                                     NA's   :281056.0  
    wsincur           ftincur             piincur              days      
 Min.   :   0.00   Min.   :    0.000   Min.   :     0.0   Min.   :  0.0  
 1st Qu.:   0.00   1st Qu.:    0.000   1st Qu.:     0.0   1st Qu.:123.0  
 Median :   0.00   Median :    0.000   Median :     0.0   Median :340.0  
 Mean   :   2.07   Mean   :    5.183   Mean   :   345.8   Mean   :248.7  
 3rd Qu.:   0.00   3rd Qu.:    0.000   3rd Qu.:     0.0   3rd Qu.:364.0  
 Max.   :2004.64   Max.   :25082.910   Max.   :484550.1   Max.   :365.0  
                                       NA's   :281056.0                  
    minagen         primagen    
 Min.   :17.00   Min.   :17.00  
 1st Qu.:41.00   1st Qu.:43.00  
 Median :56.00   Median :53.00  
 Mean   :63.81   Mean   :53.25  
 3rd Qu.:99.00   3rd Qu.:64.00  
 Max.   :99.00   Max.   :93.00  

#########################################################################

It worked all right, but I'm just wondering if there is a more efficient 
way (it takes about 10 minutes to run the above scripts, for my 300,000 x 
25 CSV file)?

For example, the CSV file has 25 columns but I don't need 3 of them (6, 7, 
and 22).  What I have done is to scan them in anyway, convert the list 
into a data frame then remove the 3 columns.  Just wonder if it is 
possible to simply ignore them in scan() to make the process faster?

-- 
Cheers,

Kevin

------------------------------------------------------------------------------
/* Time is the greatest teacher, unfortunately it kills its students */

--
Ko-Kang Kevin Wang
Master of Science (MSc) Student
SLC Tutor and Lab Demonstrator
Department of Statistics
University of Auckland
New Zealand
Homepage: http://www.stat.auckland.ac.nz/~kwan022
Ph: 373-7599
    x88475 (City)
    x88480 (Tamaki)