[R] how to import such data to R?
Marc Schwartz
MSchwartz at mn.rr.com
Sat Oct 15 18:43:37 CEST 2005
On Sat, 2005-10-15 at 23:54 +0800, ronggui wrote:
> It seems my last post not sent successfully ,so I post again.
>
> -------------
> the data file has such structure:
>
> 1992 6245 49 . . 20 1
> 0 0 8.739536 0 . . .
> . . . . . "alabama"
> . 0 .
> 1993 7677 58 . . 15 1
> 0 0 8.945984 1 . 0 .2064476
> -5 0 . 0 8.739536 "alabama"
> 9 0 0
> 1992 13327 57 36 58 16 0
> 0 0 9.497547 0 47 . .
> . . . 0 . "arizona"
> . 0 .
> 1993 19860 57 36 58 16 1
> 1 0 9.896463 1 47 0 .3989162
> 0 1 0 1 9.497547 "arizona"
> 0 1 1
> 1992 10422 37 28 58 20 0
> 0 0 9.251675 0 43 . .
> . . . -1 . "arizona state"
> . 0 .
>
> ------snip-----
>
> the data descriptions is:
>
> variable names:
>
> year apps top25 ver500 mth500 stufac bowl btitle
> finfour lapps d93 avg500 cfinfour clapps cstufac cbowl
> cavg500 cbtitle lapps_1 school ctop25 bball cbball
>
> Obs: 118
>
> 1. year 1992 or 1993
> 2. apps # applics for admission
> 3. top25 perc frosh class in 25th high sch percen
> 4. ver500 perc frosh >= 500 on verbal SAT
> 5. mth500 perc frosh >= 500 on math SAT
> 6. stufac student-faculty ratio
> 7. bowl = 1 if bowl game in prev year
> 8. btitle = 1 if men's cnf chmps prev year
> 9. finfour = 1 if men's final 4 prev year
> 10. lapps log(apps)
> 11. d93 =1 if year = 1993
> 12. avg500 (ver500+mth500)/2
> 13. cfinfour change in finfour
> 14. clapps change in lapps
> 15. cstufac change in stufac
> 16. cbowl change in bowl
> 17. cavg500 change in avg500
> 18. cbtitle change in btitle
> 19. lapps_1 lapps lagged
> 20. school university name
> 21. ctop25 change in top25
> 22. bball =1 if btitle or finfour
> 23. cbball change in bball
>
>
> so the each four lines represent one case,can some variables are numeric and some are character.
> I though the scan can read it in ,but it seems somewhat tricky as the mixed type of variables.any suggestions?
There may be an easier way, but here is one possible approach:
First, use scan to read in the data. Set the 'what' argument to a list
of atomic data types, based upon your specs above. Also, set the
'na.names' argument to '.'.
This will read in the multiple lines for each record, into a single
record based upon there being 23 elements per record. That is based upon
'length(what)'. Note also the 'multi.line' argument in scan().
data <- scan("data.txt",
what = c(rep(list(numeric(0)), 19),
list(character(0)),
rep(list(numeric(0)), 3)),
na.strings = ".")
'data' is now a list of values, where each list element is a proper
column from your original data file. Now use as.data.frame(), which will
take each list element and turn it into a column in a data frame.
preserving the data types.
data <- as.data.frame(data)
Now, read in the column names for the data frame from a text file,
containing your field names above, and set the data frame column names
to these.
Names <- scan("names.txt", what = character(0))
names(data) <- Names
Now review the structure of 'data':
> data
year apps top25 ver500 mth500 stufac bowl btitle finfour lapps
1 1992 6245 49 NA NA 20 1 0 0 8.739536
2 1993 7677 58 NA NA 15 1 0 0 8.945984
3 1992 13327 57 36 58 16 0 0 0 9.497547
4 1993 19860 57 36 58 16 1 1 0 9.896463
5 1992 10422 37 28 58 20 0 0 0 9.251675
d93 avg500 cfinfour clapps cstufac cbowl cavg500 cbtitle lapps_1
1 0 NA NA NA NA NA NA NA NA
2 1 NA 0 0.2064476 -5 0 NA 0 8.739536
3 0 47 NA NA NA NA NA 0 NA
4 1 47 0 0.3989162 0 1 0 1 9.497547
5 0 43 NA NA NA NA NA -1 NA
school ctop25 bball cbball
1 alabama NA 0 NA
2 alabama 9 0 0
3 arizona NA 0 NA
4 arizona 0 1 1
5 arizona state NA 0 NA
HTH,
Marc Schwartz
More information about the R-help
mailing list