[R] A file with extension .sdb in a codebook section of a large database from a survey?

Marc Schwartz marc_schwartz at me.com
Thu Mar 25 22:38:52 CET 2010


On Mar 25, 2010, at 3:54 PM, Douglas Bates wrote:

> The TIMSS2007 database http://timss.bc.edu/TIMSS2007/idb_ug.html seems
> to provide "both kinds" of universal data formats - either SPSS saved
> data sets or SAS saved data sets.  (Yes, I am being sarcastic.)
> These, of course, are accompanied by massive codebooks explaining the
> nature of each of the fields in the data sets.  The T07_Codebooks.zip
> file available at that site contains .pdf files and .sdb files, which
> seem to contain the information from the codebooks in some kind of
> binary format.  Does anyone know where that format is defined.  I
> imagine I could reverse-engineer it but would prefer not to do so.
> 
> I would like to use part of this dataset as an example of a very large
> hierarchically structured data set for analysis in lme4.


Doug,

According to the User Guide. bottom of page 110, they are "standard Dbase" files.  I tried reading one of them with read.dbf() in 'foreign', however that did not work. It would seem that if you rename the extensions from .sdb to .dbf, then they can be read with read.dbf():

# rename ACGTMSM4.sbd to ACGTMSM4.dbf

> str(read.dbf("ACGTMSM4.dbf"))
'data.frame':	116 obs. of  28 variables:
 $ FIELD_NAME: Factor w/ 116 levels "AC4GAPAD","AC4GAPCH",..: 104 108 73 26 23 51 50 44 25 41 ...
 $ FIELD_TYPE: Factor w/ 2 levels "C","N": 2 2 2 2 1 1 1 1 2 2 ...
 $ FIELD_LEN : int  5 4 5 5 1 1 1 1 3 2 ...
 $ FIELD_DEC : int  0 0 0 0 0 0 0 0 0 0 ...
 $ FIELD_LABL: Factor w/ 116 levels "COUNTRY ID","EXPLICIT STRATUM CODE",..: 1 98 74 75 76 71 70 48 47 72 ...
 $ QUEST_LOC : Factor w/ 106 levels "COUNTRY","DATE",..: 1 8 9 10 11 12 13 14 15 16 ...
 $ MISSING   : Factor w/ 6 levels "9","99","999",..: NA NA 4 4 1 1 1 1 3 2 ...
 $ NOTAPPL   : Factor w/ 6 levels "8","98","998",..: NA NA 4 4 1 1 1 1 3 2 ...
 $ DEFAULT   : Factor w/ 4 levels "7","97","997",..: NA NA 4 4 1 1 1 1 3 2 ...
 $ FIELD_VALI: Factor w/ 105 levels ".T.","(AC4GAPAD>=0.AND.AC4GAPAD<=97).OR.AC4GAPAD=999.OR.AC4GAPAD=998",..: 1 14 13 10 30 54 53 47 9 11 ...
 $ FIELD_CODE: Factor w/ 41 levels "0 TO 10 PERCENT:1;11 TO 25 PERCENT:2;26 TO 50 PERCENT:3;MORE THAN 50 PERCENT:4;omitted:9;not admin.:8;",..: 21 22 14 11 28 1 1 29 10 12 ...
 $ FIELD_EDIT: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ FIELD_CARR: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ ORDER_SCRN: int  1 2 3 4 5 6 7 8 9 10 ...
 $ ORDER_FILE: int  1 2 3 4 5 6 7 8 9 10 ...
 $ COMMENT1  : Factor w/ 4 levels "Released in TIMSS 2003 as acdgpsc",..: NA NA NA NA NA NA NA NA NA NA ...
 $ MEAS_CLASS: Factor w/ 6 levels "B","BD","D","DERI",..: 6 6 1 1 1 1 1 1 1 1 ...
 $ IDBOOK    : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ FMT       : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ DUMMY     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ VALID_VAL : Factor w/ 7 levels ".T.","1;2;","1;2;3;",..: 1 NA NA NA 6 4 4 4 NA NA ...
 $ MIN_MAX   : Factor w/ 8 levels "0;11000","0;1200",..: NA 6 1 2 NA NA NA NA 7 8 ...
 $ FILTER_VAR: Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ FILTER_CND: Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ CONFIRMED : logi  NA NA NA NA NA NA ...
 $ SASPG1    : Factor w/ 6 levels "B","BD","DPC",..: 5 5 1 1 1 1 1 1 1 1 ...
 $ SASPG2    : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ SASPG3    : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, "data_types")= chr  "C" "C" "N" "N" ...


However, there were warnings:

> warnings()
Warning messages:
1: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field
2: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field
3: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field
...


The content of the above data frame does seem to correspond to the PDF file content.

HTH,

Marc Schwartz



More information about the R-help mailing list