[R] Ever see a stata import problem like this?
Thomas Lumley
tlumley at u.washington.edu
Wed Sep 22 03:13:55 CEST 2004
On Tue, 21 Sep 2004, Paul Johnson wrote:
> Greetings Everybody:
>
> I generated a 1.2MB dta file based on the general social survey with Stata8
> for linux. The file can be re-opened with Stata, but when I bring it into R,
> it says all the values are missing for most of the variables.
You need read.dta( ,convert.factors=FALSE)
You have variables with labels for some, but not all, of their values.
When these are converted to R factors you lose the unlabelled values. R
does not have a data type that is sometimes labelled and sometimes
numeric.
When you use convert.factors=FALSE the label information is still read in
and returned as an attribute of the data frame, so you can set individual
variables to be factors.
-thomas
>
> This dataset is called "morgen.dta" and I dropped a copy online in case you
> are interested
>
> http://www.ku.edu/~pauljohn/R/morgen.dta
>
> looks like this to R (I tried various options on the read.dta command):
>
>> myDat <- read.dta("morgen.dta")
>> summary(myDat)
> CASEID year id hrs1 hrs2
> Min. : 19721 Min. :1972 Min. : 1 NAP : 0 NAP : 0
> 1st Qu.: 1983475 1st Qu.:1978 1st Qu.: 445 DK : 0 DK : 0
> Median : 1996808 Median :1987 Median : 905 NA : 0 NA : 0
> Mean : 9963040 Mean :1986 Mean : 990 NA's:40933 NA's:40933
> 3rd Qu.:19872187 3rd Qu.:1994 3rd Qu.:1358
> Max. :20002817 Max. :2000 Max. :3247
>
> prestige agewed age educ paeduc
> DK,NA,NAP: 0 NAP : 0 DK : 0 NAP : 0 NAP : 0
> NA's :40933 DK : 0 NA : 0 DK : 0 DK : 0
> NA : 0 NA's:40933 NA : 0 NA : 0
> NA's:40933 NA's:40933 NA's:40933
>
>
>
> maeduc speduc income
> NAP : 0 NAP : 0 $25000 OR MORE:14525
> DK : 0 DK : 0 $10000 - 14999: 5022
> NA : 0 NA : 0 $15000 - 19999: 3869
> NA's:40933 NA's:40933 $20000 - 24999: 3664
> REFUSED : 1877
> (Other) : 8523
> NA's : 3453
>>
>
>
> Here's what Stata sees when I load the same thing:
>
> summarize, detail
>
> Case identification number
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 197432 19721
> 5% 199649 19722
> 10% 1974116 19723 Obs 40933
> 25% 1983475 19724 Sum of Wgt. 40933
>
> 50% 1996808 Mean 9963040
> Largest Std. Dev. 9006352
> 75% 1.99e+07 2.00e+07
> 90% 2.00e+07 2.00e+07 Variance 8.11e+13
> 95% 2.00e+07 2.00e+07 Skewness .18931
> 99% 2.00e+07 2.00e+07 Kurtosis 1.045409
>
> GSS YEAR FOR THIS RESPONDENT
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 1972 1972
> 5% 1973 1972
> 10% 1974 1972 Obs 40933
> 25% 1978 1972 Sum of Wgt. 40933
>
> 50% 1987 Mean 1986.421
> Largest Std. Dev. 8.61136
> 75% 1994 2000
> 90% 1998 2000 Variance 74.15552
> 95% 2000 2000 Skewness -.0789223
> 99% 2000 2000 Kurtosis 1.799939
>
> RESPONDENT ID NUMBER
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 18 1
> 5% 89 1
> 10% 178 1 Obs 40933
> 25% 445 1 Sum of Wgt. 40933
>
> 50% 905 Mean 989.9129
> Largest Std. Dev. 689.0596
> 75% 1358 3244
> 90% 2027 3245 Variance 474803.2
> 95% 2437 3246 Skewness .8359211
> 99% 2867 3247 Kurtosis 3.311248
>
> NUMBER OF HOURS WORKED LAST WEEK
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 6 0
> 5% 15 0
> 10% 21 0 Obs 23279
> 25% 37 0 Sum of Wgt. 23279
>
> 50% 40 Mean 41.05206
> Largest Std. Dev. 13.95931
> 75% 48 89
> 90% 60 89 Variance 194.8624
> 95% 65 89 Skewness .195045
> 99% 82 89 Kurtosis 4.448998
>
> NUMBER OF HOURS USUALLY WORK A WEEK
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 4 0
> 5% 15 0
> 10% 20 1 Obs 774
> 25% 38 2 Sum of Wgt. 774
>
> 50% 40 Mean 39.79199
> Largest Std. Dev. 13.43383
> 75% 45 89
> 90% 55 89 Variance 180.4677
> 95% 60 89 Skewness -.0002332
> 99% 80 89 Kurtosis 5.009869
>
> RS OCCUPATIONAL PRESTIGE SCORE (1970)
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 14 12
> 5% 17 12
> 10% 20 12 Obs 24267
> 25% 30 12 Sum of Wgt. 24267
>
> 50% 39 Mean 39.35645
> Largest Std. Dev. 14.03712
> 75% 48 82
> 90% 60 82 Variance 197.0407
> 95% 62 82 Skewness .2927414
> 99% 76 82 Kurtosis 2.775553
>
> AGE WHEN FIRST MARRIED
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 15 12
> 5% 17 12
> 10% 17 12 Obs 25382
> 25% 19 12 Sum of Wgt. 25382
>
> 50% 21 Mean 22.09609
> Largest Std. Dev. 4.813944
> 75% 24 63
> 90% 28 68 Variance 23.17405
> 95% 31 73 Skewness 2.002265
> 99% 39 73 Kurtosis 11.28279
>
> AGE OF RESPONDENT
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 19 18
> 5% 21 18
> 10% 24 18 Obs 40790
> 25% 30 18 Sum of Wgt. 40790
>
> 50% 42 Mean 45.14798
> Largest Std. Dev. 17.53519
> 75% 58 89
> 90% 71 89 Variance 307.4828
> 95% 77 89 Skewness .4774907
> 99% 86 89 Kurtosis 2.239618
>
> HIGHEST YEAR OF SCHOOL COMPLETED
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 3 0
> 5% 7 0
> 10% 8 0 Obs 40806
> 25% 11 0 Sum of Wgt. 40806
>
> 50% 12 Mean 12.48152
> Largest Std. Dev. 3.176226
> 75% 14 20
> 90% 16 20 Variance 10.08841
> 95% 18 20 Skewness -.3389303
> 99% 20 20 Kurtosis 3.960311
>
> HIGHEST YEAR SCHOOL COMPLETED, FATHER
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 0 0
> 5% 3 0
> 10% 4 0 Obs 29347
> 25% 8 0 Sum of Wgt. 29347
>
> 50% 11 Mean 10.20994
> Largest Std. Dev. 4.342143
> 75% 12 20
> 90% 16 20 Variance 18.85421
> 95% 17 20 Skewness -.1628909
> 99% 20 20 Kurtosis 2.826482
>
> HIGHEST YEAR SCHOOL COMPLETED, MOTHER
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 0 0
> 5% 3 0
> 10% 6 0 Obs 34151
> 25% 8 0 Sum of Wgt. 34151
>
> 50% 12 Mean 10.41478
> Largest Std. Dev. 3.709352
> 75% 12 20
> 90% 14 20 Variance 13.75929
> 95% 16 20 Skewness -.6324499
> 99% 18 20 Kurtosis 3.605715
>
> HIGHEST YEAR SCHOOL COMPLETED, SPOUSE
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 4 0
> 5% 7 0
> 10% 8 0 Obs 22780
> 25% 12 0 Sum of Wgt. 22780
>
> 50% 12 Mean 12.53095
> Largest Std. Dev. 3.103418
> 75% 14 20
> 90% 16 20 Variance 9.631203
> 95% 18 20 Skewness -.287755
> 99% 20 20 Kurtosis 4.051822
>
> TOTAL FAMILY INCOME
> -------------------------------------------------------------
> Percentiles Smallest
> 1% 1 1
> 5% 3 1
> 10% 5 1 Obs 37480
> 25% 9 1 Sum of Wgt. 37480
>
> 50% 11 Mean 9.75619
> Largest Std. Dev. 2.994967
> 75% 12 13
> 90% 12 13 Variance 8.969825
> 95% 13 13 Skewness -1.29205
> 99% 13 13 Kurtosis 3.759778
>
> .
>
>
> --
> Paul E. Johnson email: pauljohn at ku.edu
> Dept. of Political Science http://lark.cc.ku.edu/~pauljohn
> 1541 Lilac Lane, Rm 504
> University of Kansas Office: (785) 864-9086
> Lawrence, Kansas 66044-3177 FAX: (785) 864-5700
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list