StatLib---Datasets Archive If you have an interesting dataset, or collection of data from a book, please consider submitting the data. To submit a dataset, please see the submissions guidelines, via send submissions from general Some of the entries are shar archives. If you don't know how to deal with a shar archive, send the message send shar from general for instructions. --------------------------------------------------------------------------- The datasets archive currently contains: alr --> This file contains data from Applied Linear Regression, 2nd Edition, by Sanford Weisberg, John Wiley, 1985 (sandy@umnstat.stat.umn.edu) (36808 bytes) Andrews --> This data for the book DATA by Andrews and Herzberg. Available by FTP, gopher, WWW, but not e-mail. balloon --> A data set consisting of 2001 observations of radiation, taken from a balloon. The data contain a trend and outliers. Source: Laurie Davies (mata00@de0hrz1a.BITNET) (43k) [5/Feb/93] baseball --> Data on the salaries of North American Major League Baseball players. The dataset has performance and salary information on palyers during the 1986 season. This was the 1988 ASA Graphics Section Poster Session dataset, orgainised by Lorraine Denby. There are two files to retreive: baseball.data --> consists of a shar archive of the data and helpful information including a description of the data, pitcher, hitter, and team statistics (54448 bytes) baseball.corr --> A set of differences from the published data set (in Unix diff format) biomed --> I was able to find the old 1982 "biomedical dataset" generated by Larry Cox. It consists of two groups. These give observation number, blood id number,age, date, and four blood measurements. I don't really remember the instructions for analysis, although I seem to recall that the idea was to figure out if some of the blood measurements that were less difficult to obtain were as good at distinguishing carriers from normals as the more difficult measurements. Unfortunately, I don't remeember which measurement is which. There are two files to retreive: biomed.desc --> a short description of the data and a reference (1457 bytes) biomed.data --> A shar archive of containing the data for carriers and normals. (7843 bytes) bolts --> Data from an experiment on the affects of machine adjustments on the time to count bolts. Data appear as the STATS (Issue 10) Challenge. Submitted by W. Robert Stephenson (wrstephe@iastate.edu). [8/Nov/93] (5k) boston --> The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. (51256 bytes) cars --> This was the 1983 ASA Data Exposition dataset. The dataset was collected by Ernesto Ramos and David Donoho and dealt with automobiles. I don't remember the instructions for analysis. Data on mpg, cylinders, displacement, etc. (8 variables) for 406 different cars. The dataset includes the names of the cars. The data are in one file: cars.data --> A shar archive containing files with a desciption of the cars data, the names of the cars, and the cars data itself. (33438 bytes) cars.desc --> The original instructions for this exposition. (6206 bytes) cloud --> These data are those collected in a cloud-seeding experiment in Tasmania. The rainfalls are period rainfalls in inches. TE and TW are the east and west target areas respectively, while NC, SC and NWC are the corresponding rainfalls in the north, south and north-west control areas respectively. S = seeded, U = unseeded. Submitted by Alan Miller (alan@dmsmelb.mel.dms.CSIRO.AU) [4/May/94] (7 kbytes) chscase --> A collection of the data sets used in the book "A Casebook for a First Course in Statistics and Data Analysis," by Samprit Chatterjee, Mark S. Handcock and Jeffrey S. Simonoff, John Wiley and Sons, New York, 1995. Submitted by Samprit Chatterjee (schatterjee@stern.nyu.edu), Mark Handcock (mhandcock@stern.nyu.edu) and Jeff Simonoff (jsimonoff@stern.nyu.edu). (321 kbytes) csb --> See the separate csb collection for Data from the book "Case Studies in Biometry". detroit --> Data on annual homicides in Detroit, 1961-73, from Gunst & Mason's book `Regression Analysis and its Application', Marcel Dekker. Contains data on 14 relevant variables collected by J.C. Fisher. (alan@dmsmelb.mel.dms.csiro.au) [10/Feb/92] (3357 bytes) diggle --> Data-sets from Diggle, P.J. (1990). Time Series : A Biostatistical Introduction. Oxford University Press. Submitted by Peter Diggle, (maa026@central1.lancaster.ac.uk) (35800 bytes) econdata --> Directions for obtain a large collection of economic data from the University of Maryland. [6/Nov/92] (22kb) fienberg --> The data from Fienberg's "The Analysis of Cross-Classified Data", in a form that can easily be read into Glim (or easily read by a human). [25/Sept/91] (mikem@stat.cmu.edu) (14398 bytes). fraser-river --> Time series of monthly flows for the Fraser River at Hope, B.C. A. Ian McLeod [26/April/93] (10 kbytes) humandevel --> United Nations Development Program, Human Development Index. A nation's HDI is composed of life expectancy, adult literacy and Gross National Product per capita. Information on 130 countries plus documentation. (arnold@stat.ncsu.edu (Tim Arnold)) [31/Oct/91] (10031 bytes). irish.ed --> Longtitudinal educational transition data set for a sample of 500 Irish students, with 4 independent variables (sex, verbal reasoning score, father's occupation, type of school). Submitted by Adrian E. Raftery , [20/Dec/93] (13 kbytes) lmpavw --> time series used in "Long-Memory Processes, the Allan Variance and Wavelets" by D. B. Percival and P. Guttorp, a chapter in "Wavelets in Geophysics", edited by E. Foufoula-Georgiou and P. Kumar, Academic Press, 1994 This "time" series was collected by Mike Gregg, Applied Physics Laboratory, University of Washington, and is a measurement of vertical shear (in units of 1/seconds) versus depth (in units of meters) in the ocean. The role of "time" in this series is thus played by depth. Permission has been obtained to redistribute this data. Questions concerning this series should be send to Don Percival (dbp@apl.washington.edu). [6/Feb/94] (62 kbytes) longley --> The infamous Longley data, "An appraisal of least-squares programs from the point of view of the user", JASA, 62(1967) p819-841. (therneau@mayo.edu) (1301 bytes) newton_hema --> Data on fluctuating proportions of marked cells in marrow from heterozygous Safari cats--from a study of early hematopoiesis. Michael Newton (newton@stat.wisc.edu) [8/Nov/93] (5k) nonlin --> The data sets from Bates and Watts (1988) "Nonlinear Regression Analysis and Its Applications", Wiley. They are in S dump format as data frames. (If you don't know what a data frame is, don't worry. Just consider them to be lists. Data frames are described in a book on "Statistical Modelling in S" (bates@stat.wisc.edu) [7/Feb/90] (19851 bytes) pbc --> The data set found in appendix D of Fleming and Harrington, Counting Processes and Survival Analysis, Wiley, 1991. Submitted by therneau@Mayo.EDU (Terry Therneau), [25/Jul/94] (36 kbytes) places --> Data taken from the Places Rated Almanac, giving the ratings on 9 composite variables of 329 locations. (From an ASA data exposition, 1986) The data are in one file: places.data --> A shar archive of three files which document the data, present the data itself, and provide a key to the actual places used. (27720 byes) pollen --> Synthetic dataset about the geometric features of pollen grains. There are 3848 observations on 5 variables. From the 1986 ASA Data Exposition dataset, made up by David Coleman of RCA Labs. The data are in one file: pollen.data --> A shar archive of 9 files. The first file gives a short description of the data, then there are 8 data files, each with 481 observations. (205954 bytes) pollution --> This is the pollution data so loved by writers of papers on ridge regression. Source: McDonald, G.C. and Schwing, R.C. (1973) 'Instabilities of regression estimates relating air pollution to mortality', Technometrics, vol.15, 463-482. (8540 bytes) profb --> Scores and point spreads for all NFL games in the 1989-91 seasons. Contributed by Robin Lock (rlock@stlawu.bitnet) [15/Sept/92] (27733 bytes) rabe --> This file contains data from Regression Analysis By Example, 2nd Edition, by Samprit Chatterjee and Bertram Price, John Wiley, 1991. (schatter@stern.nyu.edu) [6/Feb/92] (40309 bytes) rir --> This file contains data from Residuals and Influence in Regression, R. Dennis Cook and Sanford Weisberg, Chapman and Hall, 1982. (sandy@umnstat.stat.umn.edu) (5206 bytes). [Updated 25/May/93] riverflow --> Datasets mentioned in "Parsimony, Model Adequacy and Periodic Correlation in Time Series Forecasting", ISI Review, A.I. McLeod (1992, to appear). Submitted by A.Ian McLeod (aim@stats.uwo.ca). Time series data. A shar archive. [22/Jan/92] (294052 bytes). sapa --> time series used in "Spectral Analysis for Physical Applications" by D. B. Percival and A. T. Walden, Cambridge University Press, 1993. (dbp@apl.washington.edu) [4/Nov/92](50788 bytes) saubts --> Two ocean wave time series used in "Spectral Analysis of Univariate and Bivariate Time Series" by D. B. Percival, Chapter 11 of "Statistical Methods for Physical Science," edited by J. L. Stanford and S. B. Vardeman, Academic Press, 1993. (dbp@apl.washington.edu) [14/Apr/93] (47 kbytes) ships --> Ship damage data, from "Generalized Linear Models" by McCullagh and Nelder, section 6.3.2, page 137. (therneau@mayo.edu) (1709 bytes) sleep --> Data from which conclusions were drawn in the article "Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976), _Science_, November 12, vol. 194, pp. 732-734. Includes brain and body weight, life span, gestation time, time sleeping, and predation and danger indices for 62 mammals. Submitted by Roger Johnson (rjohnson@carleton.edu) [27/Jul/94] (8k) socmob --> Social Mobility (US, 1973). Two four-way 17x17x2x2 contingency tables: Father's occupation, Son's occupation (first and current), family structure, race. Submitted by Timothy J. Biblarz (biblarz@uscvm.bitnet). [corrected 25/Jan/93] stanford --> Two versions of the Stanford Heart Transplant Data, one "The Statistical Analysis of Failure Time Data" by Kalbfleisch and Prentice, Appendix I, pages 230-232, the other from the original paper by Crowley and Hu. (therneau@mayo.edu) (15003 bytes) [Corrected, 8/Mar/93] tumor --> Tumor Recurrence data for patients with Bladder cancer Taken from Wei, Lin and Weissfeld, JASA 1989, p 1067. From: therneau@mayo.edu (Terry Therneau) [23/Mar/93] veteran --> Veteran's Administration Lung Cancer Trial, Taken from Kalbfleisch and Prentice, pages 223-224 (therneau@mayo.edu) (8249 bytes) visualizing.data --> This shar file contains 25 data sets from the book Visualizing Data published by Hobart Press (books@hobart.com) and written by William S. Cleveland (wsc@research.att.com). There is also a README file so there are 26 files in all. Each of the 25 files has the data in an ascii table format. The name of each data file is the name of the data set used in the book. To find the description of the data set in the book look under the entry "data, name" in the index. For example, one data set is barley. To find the description of barley, look in the index under the entry "data, barley". The S archive of Statlib has a file created by S that contains the data sets in a format that makes it easy to read them into S. (536 kbytes) [12/Nov/93][17/Oct/94] wind --> daily average wind speeds for 1961-1978 at 12 synoptic meteorological stations in the Republic of Ireland (Haslett and Raftery, Applied Statistics 1989). There is a LARGE amount of data. Please be sure you want it before you ask for it!! There are two entries to obtain. wind.desc --> A short desciption of the data (815 bytes) wind.data --> The data (532494 bytes). witmer --> A shar archive of data from the book Data Analysis: An Introduction(1992) Prentice Hall bu Jeff Witmer. Submitted by Jeff Witmer (fwitmer@ocvaxa.cc.oberlin.edu) [28/Jun/94] (29 kbytes) wseries --> These data tell whether or not the home team won for each game played in all World Series prior to 1994. The data appear as the STATS Challenge for Issue 11. Submitted by Jeff Witmer (fwitmer@ocvaxa.cc.oberlin.edu) [20/Mar/94] (3 kbytes) submissions --> Information on how to submit data to this archive. --------------------------------------------------------------------------- Other Sources For WWW (Mosaic, Netscape, etc.) users, here are a set of links to other sources of Data. These sources are not kept on StatLib, and we have no control over them. If you find a link is consistently not working, let us know. EconData --> Several hundred thousand economic time series, produced by the U.S. Government and distributed by the government in a variety of formats and media, have been put into a standard, highly efficient, easy-to- use form for personal computers. Oceanographic & Earth Science Data --> From Scripps Institution of Oceanography Library The Data Zoo --> California coastal data collection Journal of Statistics Education Information Service --> Also has some data --------------------------------------------------------------------------- Credit where credit is due If you use an algorithm, dataset, or other information from StatLib, please acknowledge both StatLib and the original contributor of the material. --------------------------------------------------------------------------- Last edited on Tue Nov 8 1994 by Mike Meyer