[R] how to import such data to R?

Sat Oct 15 18:43:37 CEST 2005

On Sat, 2005-10-15 at 23:54 +0800, ronggui wrote:
> It seems my last post  not sent successfully ,so I post again.
> 
> -------------
> the data file has such structure:
> 
>      1992       6245         49          .          .         20          1
>         0          0   8.739536          0          .          .          .
>         .          .          .          .          .            "alabama"
>         .          0          .
>      1993       7677         58          .          .         15          1
>         0          0   8.945984          1          .          0   .2064476
>        -5          0          .          0   8.739536            "alabama"
>         9          0          0
>      1992      13327         57         36         58         16          0
>         0          0   9.497547          0         47          .          .
>         .          .          .          0          .            "arizona"
>         .          0          .
>      1993      19860         57         36         58         16          1
>         1          0   9.896463          1         47          0   .3989162
>         0          1          0          1   9.497547            "arizona"
>         0          1          1
>      1992      10422         37         28         58         20          0
>         0          0   9.251675          0         43          .          .
>         .          .          .         -1          .      "arizona state"
>         .          0          .
> 
> ------snip-----
> 
> the data descriptions is:
> 
> variable names:
> 
> year      apps      top25     ver500    mth500    stufac    bowl      btitle   
> finfour   lapps     d93       avg500    cfinfour  clapps    cstufac   cbowl    
> cavg500   cbtitle   lapps_1   school    ctop25    bball     cbball    
> 
>   Obs:   118
> 
>   1. year                     1992 or 1993
>   2. apps                     # applics for admission
>   3. top25                    perc frosh class in 25th high sch percen
>   4. ver500                   perc frosh >= 500 on verbal SAT
>   5. mth500                   perc frosh >= 500 on math SAT
>   6. stufac                   student-faculty ratio
>   7. bowl                     = 1 if bowl game in prev year
>   8. btitle                   = 1 if men's cnf chmps prev year
>   9. finfour                  = 1 if men's final 4 prev year
>  10. lapps                    log(apps)
>  11. d93                      =1 if year = 1993
>  12. avg500                   (ver500+mth500)/2
>  13. cfinfour                 change in finfour
>  14. clapps                   change in lapps
>  15. cstufac                  change in stufac
>  16. cbowl                    change in bowl
>  17. cavg500                  change in avg500
>  18. cbtitle                  change in btitle
>  19. lapps_1                  lapps lagged
>  20. school                   university name
>  21. ctop25                   change in top25
>  22. bball                    =1 if btitle or finfour
>  23. cbball                   change in bball
> 
> 
> so the each four lines represent  one case,can some variables are numeric and some are character.
> I though the scan can read it in ,but it seems somewhat tricky as the mixed type of variables.any suggestions?

There may be an easier way, but here is one possible approach:

First, use scan to read in the data. Set the 'what' argument to a list
of atomic data types, based upon your specs above. Also, set the
'na.names' argument to '.'.

This will read in the multiple lines for each record, into a single
record based upon there being 23 elements per record. That is based upon
'length(what)'.  Note also the 'multi.line' argument in scan().

data <- scan("data.txt", 
             what = c(rep(list(numeric(0)), 19), 
                      list(character(0)), 
                      rep(list(numeric(0)), 3)), 
             na.strings = ".")

'data' is now a list of values, where each list element is a proper
column from your original data file. Now use as.data.frame(), which will
take each list element and turn it into a column in a data frame.
preserving the data types.

data <- as.data.frame(data)

Now, read in the column names for the data frame from a text file,
containing your field names above, and set the data frame column names
to these.

Names <- scan("names.txt", what = character(0))
names(data) <- Names

Now review the structure of 'data':

> data
  year  apps top25 ver500 mth500 stufac bowl btitle finfour    lapps
1 1992  6245    49     NA     NA     20    1      0       0 8.739536
2 1993  7677    58     NA     NA     15    1      0       0 8.945984
3 1992 13327    57     36     58     16    0      0       0 9.497547
4 1993 19860    57     36     58     16    1      1       0 9.896463
5 1992 10422    37     28     58     20    0      0       0 9.251675
  d93 avg500 cfinfour    clapps cstufac cbowl cavg500 cbtitle  lapps_1
1   0     NA       NA        NA      NA    NA      NA      NA       NA
2   1     NA        0 0.2064476      -5     0      NA       0 8.739536
3   0     47       NA        NA      NA    NA      NA       0       NA
4   1     47        0 0.3989162       0     1       0       1 9.497547
5   0     43       NA        NA      NA    NA      NA      -1       NA
         school ctop25 bball cbball
1       alabama     NA     0     NA
2       alabama      9     0      0
3       arizona     NA     0     NA
4       arizona      0     1      1
5 arizona state     NA     0     NA

HTH,

Marc Schwartz