[R] Datasets for "The Statistical Sleuth"

Sun Oct 25 09:24:40 CET 2009

On Sun, Oct 25, 2009 at 5:48 AM, Yihui Xie <xieyihui at gmail.com> wrote:
> Hi everyone,
>
> I wonder if there already exists any R packages containing all the
> data sets for the book "The Statistical Sleuth"
> (http://www.proaxis.com/~panorama/home.htm; also available at StatLib
> http://lib.stat.cmu.edu/datasets/sleuth).
>
> I'm writing an R package with a friend for one of our stat courses
> where SAS is the main tool being used. As the time is limited and half
> of the semester has gone, we want to finish the package ASAP before
> the biased (my personal feeling) impression towards R comes up. It
> will save us some time (especially the time on writing R
> documentation) if anyone has already done the work of packing up all
> the data sets. Thanks a lot!

 You should be able to read the spss versions of the data files using
'read.spss' from the "foreign" package. I've just read in all the .sav
files from the 2nd edition data sets with no errors.

 Probably all you then need to do is convert them to data frames and
save them as a .RData file which your students can "attach". Actually
it's turning out quicker for me to do this than to tell you how :)

 Get the spss.exe, unzip it to create a load of .sav files, install
the 'foreign' package if you don't have it already, then do this in R:

require(foreign)
e=new.env()
for(f in list.files(pattern=".sav")){
  name = sub(".sav","",f)
  data = as.data.frame(read.spss(f))
  assign(name,data,env=e)
}
save(file="statsleuth.RData",list=ls(e),envir=e)

Then to test start a new R session and do:

 > attach("statsleuth.RData")
 > summary(ex1611)
          COUNTRY      PCTCATH         P2PRATIO       PCTINDIG
 Argentina    : 1   Min.   : 1.20   Min.   : 0.9   Min.   : 13.00
 Australia    : 1   1st Qu.:28.60   1st Qu.: 1.8   1st Qu.: 58.50
 Bolivia      : 1   Median :82.10   Median : 3.8   Median : 76.00
 Brazil       : 1   Mean   :63.74   Mean   : 5.1   Mean   : 70.53
 Chile        : 1   3rd Qu.:95.50   3rd Qu.: 8.3   3rd Qu.: 92.00
 Ecuador      : 1   Max.   :97.60   Max.   :11.9   Max.   :100.00
 (Other)      :15                                  NA's   :  2.00

 > ls("file:statsleuth.RData")
  [1] "case0101" "case0102" "case0201" "case0202" "case0301" "case0302"
  [7] "case0401" "case0402" "case0501" "case0502" "case0601" "case0602"
 [13] "case0701" "case0702" "case0801" "case0802" "case0901" "case0902"
[etc etc etc etc]

 My only worry is whether all the data sets convert to data frames
okay, and nothing is lost in the conversion. It's possible that SPSS
has all sorts of other metadata that is dropped, or something. I'd
suggest you check all 140 data sets first...

Barry