[R] Ever see a stata import problem like this?

Thomas Lumley tlumley at u.washington.edu
Wed Sep 22 03:13:55 CEST 2004


On Tue, 21 Sep 2004, Paul Johnson wrote:

> Greetings Everybody:
>
> I generated a 1.2MB dta file based on the general social survey with Stata8 
> for linux. The file can be re-opened with Stata, but when I bring it into R, 
> it says all the values are missing for most of the variables.

You need read.dta( ,convert.factors=FALSE)

You have variables with labels for some, but not all, of their values. 
When these are converted to R factors you lose the unlabelled values.  R 
does not have a data type that is sometimes labelled and sometimes 
numeric.

When you use convert.factors=FALSE the label information is still read in 
and returned as an attribute of the data frame, so you can set individual 
variables to be factors.

 	-thomas


>
> This dataset is called "morgen.dta" and I dropped a copy online in case you 
> are interested
>
> http://www.ku.edu/~pauljohn/R/morgen.dta
>
> looks like this to R (I tried various options on the read.dta command):
>
>> myDat <- read.dta("morgen.dta")
>> summary(myDat)
>     CASEID              year            id         hrs1         hrs2
> Min.   :   19721   Min.   :1972   Min.   :   1   NAP :    0   NAP :    0
> 1st Qu.: 1983475   1st Qu.:1978   1st Qu.: 445   DK  :    0   DK  :    0
> Median : 1996808   Median :1987   Median : 905   NA  :    0   NA  :    0
> Mean   : 9963040   Mean   :1986   Mean   : 990   NA's:40933   NA's:40933
> 3rd Qu.:19872187   3rd Qu.:1994   3rd Qu.:1358
> Max.   :20002817   Max.   :2000   Max.   :3247
>
>      prestige      agewed        age          educ        paeduc
> DK,NA,NAP:    0   NAP :    0   DK  :    0   NAP :    0   NAP :    0
> NA's     :40933   DK  :    0   NA  :    0   DK  :    0   DK  :    0
>                   NA  :    0   NA's:40933   NA  :    0   NA  :    0
>                   NA's:40933                NA's:40933   NA's:40933
>
>
>
>  maeduc       speduc                 income
> NAP :    0   NAP :    0   $25000 OR MORE:14525
> DK  :    0   DK  :    0   $10000 - 14999: 5022
> NA  :    0   NA  :    0   $15000 - 19999: 3869
> NA's:40933   NA's:40933   $20000 - 24999: 3664
>                           REFUSED       : 1877
>                           (Other)       : 8523
>                           NA's          : 3453
>>
>
>
> Here's what Stata sees when I load the same thing:
>
> summarize, detail
>
>                 Case identification number
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%       197432          19721
> 5%       199649          19722
> 10%      1974116          19723       Obs               40933
> 25%      1983475          19724       Sum of Wgt.       40933
>
> 50%      1996808                      Mean            9963040
>                        Largest       Std. Dev.       9006352
> 75%     1.99e+07       2.00e+07
> 90%     2.00e+07       2.00e+07       Variance       8.11e+13
> 95%     2.00e+07       2.00e+07       Skewness         .18931
> 99%     2.00e+07       2.00e+07       Kurtosis       1.045409
>
>                GSS YEAR FOR THIS RESPONDENT
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%         1972           1972
> 5%         1973           1972
> 10%         1974           1972       Obs               40933
> 25%         1978           1972       Sum of Wgt.       40933
>
> 50%         1987                      Mean           1986.421
>                        Largest       Std. Dev.       8.61136
> 75%         1994           2000
> 90%         1998           2000       Variance       74.15552
> 95%         2000           2000       Skewness      -.0789223
> 99%         2000           2000       Kurtosis       1.799939
>
>                    RESPONDENT ID NUMBER
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%           18              1
> 5%           89              1
> 10%          178              1       Obs               40933
> 25%          445              1       Sum of Wgt.       40933
>
> 50%          905                      Mean           989.9129
>                        Largest       Std. Dev.      689.0596
> 75%         1358           3244
> 90%         2027           3245       Variance       474803.2
> 95%         2437           3246       Skewness       .8359211
> 99%         2867           3247       Kurtosis       3.311248
>
>              NUMBER OF HOURS WORKED LAST WEEK
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%            6              0
> 5%           15              0
> 10%           21              0       Obs               23279
> 25%           37              0       Sum of Wgt.       23279
>
> 50%           40                      Mean           41.05206
>                        Largest       Std. Dev.      13.95931
> 75%           48             89
> 90%           60             89       Variance       194.8624
> 95%           65             89       Skewness        .195045
> 99%           82             89       Kurtosis       4.448998
>
>             NUMBER OF HOURS USUALLY WORK A WEEK
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%            4              0
> 5%           15              0
> 10%           20              1       Obs                 774
> 25%           38              2       Sum of Wgt.         774
>
> 50%           40                      Mean           39.79199
>                        Largest       Std. Dev.      13.43383
> 75%           45             89
> 90%           55             89       Variance       180.4677
> 95%           60             89       Skewness      -.0002332
> 99%           80             89       Kurtosis       5.009869
>
>           RS OCCUPATIONAL PRESTIGE SCORE  (1970)
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%           14             12
> 5%           17             12
> 10%           20             12       Obs               24267
> 25%           30             12       Sum of Wgt.       24267
>
> 50%           39                      Mean           39.35645
>                        Largest       Std. Dev.      14.03712
> 75%           48             82
> 90%           60             82       Variance       197.0407
> 95%           62             82       Skewness       .2927414
> 99%           76             82       Kurtosis       2.775553
>
>                   AGE WHEN FIRST MARRIED
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%           15             12
> 5%           17             12
> 10%           17             12       Obs               25382
> 25%           19             12       Sum of Wgt.       25382
>
> 50%           21                      Mean           22.09609
>                        Largest       Std. Dev.      4.813944
> 75%           24             63
> 90%           28             68       Variance       23.17405
> 95%           31             73       Skewness       2.002265
> 99%           39             73       Kurtosis       11.28279
>
>                      AGE OF RESPONDENT
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%           19             18
> 5%           21             18
> 10%           24             18       Obs               40790
> 25%           30             18       Sum of Wgt.       40790
>
> 50%           42                      Mean           45.14798
>                        Largest       Std. Dev.      17.53519
> 75%           58             89
> 90%           71             89       Variance       307.4828
> 95%           77             89       Skewness       .4774907
> 99%           86             89       Kurtosis       2.239618
>
>              HIGHEST YEAR OF SCHOOL COMPLETED
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%            3              0
> 5%            7              0
> 10%            8              0       Obs               40806
> 25%           11              0       Sum of Wgt.       40806
>
> 50%           12                      Mean           12.48152
>                        Largest       Std. Dev.      3.176226
> 75%           14             20
> 90%           16             20       Variance       10.08841
> 95%           18             20       Skewness      -.3389303
> 99%           20             20       Kurtosis       3.960311
>
>            HIGHEST YEAR SCHOOL COMPLETED, FATHER
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%            0              0
> 5%            3              0
> 10%            4              0       Obs               29347
> 25%            8              0       Sum of Wgt.       29347
>
> 50%           11                      Mean           10.20994
>                        Largest       Std. Dev.      4.342143
> 75%           12             20
> 90%           16             20       Variance       18.85421
> 95%           17             20       Skewness      -.1628909
> 99%           20             20       Kurtosis       2.826482
>
>            HIGHEST YEAR SCHOOL COMPLETED, MOTHER
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%            0              0
> 5%            3              0
> 10%            6              0       Obs               34151
> 25%            8              0       Sum of Wgt.       34151
>
> 50%           12                      Mean           10.41478
>                        Largest       Std. Dev.      3.709352
> 75%           12             20
> 90%           14             20       Variance       13.75929
> 95%           16             20       Skewness      -.6324499
> 99%           18             20       Kurtosis       3.605715
>
>            HIGHEST YEAR SCHOOL COMPLETED, SPOUSE
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%            4              0
> 5%            7              0
> 10%            8              0       Obs               22780
> 25%           12              0       Sum of Wgt.       22780
>
> 50%           12                      Mean           12.53095
>                        Largest       Std. Dev.      3.103418
> 75%           14             20
> 90%           16             20       Variance       9.631203
> 95%           18             20       Skewness       -.287755
> 99%           20             20       Kurtosis       4.051822
>
>                     TOTAL FAMILY INCOME
> -------------------------------------------------------------
>      Percentiles      Smallest
> 1%            1              1
> 5%            3              1
> 10%            5              1       Obs               37480
> 25%            9              1       Sum of Wgt.       37480
>
> 50%           11                      Mean            9.75619
>                        Largest       Std. Dev.      2.994967
> 75%           12             13
> 90%           12             13       Variance       8.969825
> 95%           13             13       Skewness       -1.29205
> 99%           13             13       Kurtosis       3.759778
>
> .
>
>
> -- 
> Paul E. Johnson                       email: pauljohn at ku.edu
> Dept. of Political Science            http://lark.cc.ku.edu/~pauljohn
> 1541 Lilac Lane, Rm 504
> University of Kansas                  Office: (785) 864-9086
> Lawrence, Kansas 66044-3177           FAX: (785) 864-5700
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list