StatLib---Datasets Archive

If you have an interesting dataset, or collection of data from a book,
please consider submitting the data.

To submit a dataset, please see the submissions guidelines, via

  send submissions from general
Some of the entries are shar archives.  If you don't know how to deal with
a shar archive, send the message

  send shar from general
for instructions.

---------------------------------------------------------------------------

The datasets archive currently contains:

	 alr --> This file contains data from Applied Linear Regression,
	    2nd Edition, by Sanford Weisberg, John Wiley, 1985
	    (sandy@umnstat.stat.umn.edu) (36808 bytes)
	 Andrews --> This data for the book DATA by Andrews and Herzberg.
	    Available by FTP, gopher, WWW, but not e-mail.
	 balloon --> A data set consisting of 2001 observations of
	    radiation, taken from a balloon.  The data contain a trend and
	    outliers.  Source: Laurie Davies (mata00@de0hrz1a.BITNET) (43k)
	    [5/Feb/93]
	 baseball --> Data on the salaries of North American Major League
	    Baseball players.  The dataset has performance and salary
	    information on palyers during the 1986 season.  This was the
	    1988 ASA Graphics Section Poster Session dataset, orgainised by
	    Lorraine Denby.  There are two files to retreive:
		 baseball.data --> consists of a shar archive of the data
		    and helpful information including a description of the
		    data, pitcher, hitter, and team statistics (54448
		    bytes)
		 baseball.corr --> A set of differences from the published
		    data set (in Unix diff format)

	 biomed --> I was able to find the old 1982 "biomedical dataset"
	    generated by Larry Cox.  It consists of two groups.  These give
	    observation number, blood id number,age, date, and four blood
	    measurements.  I don't really remember the instructions for
	    analysis, although I seem to recall that the idea was to figure
	    out if some of the blood measurements that were less difficult
	    to obtain were as good at distinguishing carriers from normals
	    as the more difficult measurements.  Unfortunately, I don't
	    remeember which measurement is which.  There are two files to
	    retreive:
		 biomed.desc --> a short description of the data and a
		    reference (1457 bytes)
		 biomed.data --> A shar archive of containing the data for
		    carriers and normals.  (7843 bytes)

	 bolts --> Data from an experiment on the affects of machine
	    adjustments on the time to count bolts.  Data appear as the
	    STATS (Issue 10) Challenge.  Submitted by W.  Robert Stephenson
	    (wrstephe@iastate.edu).  [8/Nov/93] (5k)
	 boston --> The Boston house-price data of Harrison, D.  and
	    Rubinfeld, D.L.  'Hedonic prices and the demand for clean air',
	    J.  Environ.  Economics & Management, vol.5, 81-102, 1978.
	    Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...',
	    Wiley, 1980.  (51256 bytes)
	 cars --> This was the 1983 ASA Data Exposition dataset.  The
	    dataset was collected by Ernesto Ramos and David Donoho and
	    dealt with automobiles.  I don't remember the instructions for
	    analysis.  Data on mpg, cylinders, displacement, etc.  (8
	    variables) for 406 different cars.  The dataset includes the
	    names of the cars.  The data are in one file:
		 cars.data --> A shar archive containing files with a
		    desciption of the cars data, the names of the cars, and
		    the cars data itself.  (33438 bytes)
		 cars.desc --> The original instructions for this
		    exposition.  (6206 bytes)

	 cloud --> These data are those collected in a cloud-seeding
	    experiment in Tasmania.  The rainfalls are period rainfalls in
	    inches.  TE and TW are the east and west target areas
	    respectively, while NC, SC and NWC are the corresponding
	    rainfalls in the north, south and north-west control areas
	    respectively.  S = seeded, U = unseeded.  Submitted by Alan
	    Miller (alan@dmsmelb.mel.dms.CSIRO.AU) [4/May/94] (7 kbytes)
	 chscase --> A collection of the data sets used in the book "A
	    Casebook for a First Course in Statistics and Data Analysis,"
	    by Samprit Chatterjee, Mark S.  Handcock and Jeffrey S.
	    Simonoff, John Wiley and Sons, New York, 1995.  Submitted by
	    Samprit Chatterjee (schatterjee@stern.nyu.edu), Mark Handcock
	    (mhandcock@stern.nyu.edu) and Jeff Simonoff
	    (jsimonoff@stern.nyu.edu).  (321 kbytes)
	 csb --> See the separate csb collection for Data from the book
	    "Case Studies in Biometry".
	 detroit --> Data on annual homicides in Detroit, 1961-73, from
	    Gunst & Mason's book `Regression Analysis and its Application',
	    Marcel Dekker.  Contains data on 14 relevant variables
	    collected by J.C.  Fisher.  (alan@dmsmelb.mel.dms.csiro.au)
	    [10/Feb/92] (3357 bytes)
	 diggle --> Data-sets from Diggle, P.J.  (1990).  Time Series : A
	    Biostatistical Introduction.  Oxford University Press.
	    Submitted by Peter Diggle, (maa026@central1.lancaster.ac.uk)
	    (35800 bytes)
	 econdata --> Directions for obtain a large collection of economic
	    data from the University of Maryland.  [6/Nov/92] (22kb)
	 fienberg --> The data from Fienberg's "The Analysis of
	    Cross-Classified Data", in a form that can easily be read into
	    Glim (or easily read by a human).  [25/Sept/91]
	    (mikem@stat.cmu.edu) (14398 bytes).
	 fraser-river --> Time series of monthly flows for the Fraser River
	    at Hope, B.C.  A.  Ian McLeod  [26/April/93]
	    (10 kbytes)
	 humandevel --> United Nations Development Program, Human
	    Development Index.  A nation's HDI is composed of life
	    expectancy, adult literacy and Gross National Product per
	    capita.  Information on 130 countries plus documentation.
	    (arnold@stat.ncsu.edu (Tim Arnold)) [31/Oct/91] (10031 bytes).
	 irish.ed --> Longtitudinal educational transition data set for a
	    sample of 500 Irish students, with 4 independent variables
	    (sex, verbal reasoning score, father's occupation, type of
	    school).  Submitted by Adrian E.  Raftery
	    , [20/Dec/93] (13 kbytes)
	 lmpavw --> time series used in "Long-Memory Processes, the Allan
	    Variance and Wavelets" by D.  B.  Percival and P.  Guttorp, a
	    chapter in "Wavelets in Geophysics", edited by E.
	    Foufoula-Georgiou and P.  Kumar, Academic Press, 1994 This
	    "time" series was collected by Mike Gregg, Applied Physics
	    Laboratory, University of Washington, and is a measurement of
	    vertical shear (in units of 1/seconds) versus depth (in units
	    of meters) in the ocean.  The role of "time" in this series is
	    thus played by depth.  Permission has been obtained to
	    redistribute this data.  Questions concerning this series
	    should be send to Don Percival (dbp@apl.washington.edu).
	    [6/Feb/94] (62 kbytes)
	 longley --> The infamous Longley data, "An appraisal of
	    least-squares programs from the point of view of the user",
	    JASA, 62(1967) p819-841.  (therneau@mayo.edu) (1301 bytes)
	 newton_hema --> Data on fluctuating proportions of marked cells in
	    marrow from heterozygous Safari cats--from a study of early
	    hematopoiesis.  Michael Newton (newton@stat.wisc.edu)
	    [8/Nov/93] (5k)
	 nonlin --> The data sets from Bates and Watts (1988) "Nonlinear
	    Regression Analysis and Its Applications", Wiley.  They are in
	    S dump format as data frames.  (If you don't know what a data
	    frame is, don't worry.  Just consider them to be lists.  Data
	    frames are described in a book on "Statistical Modelling in S"
	    (bates@stat.wisc.edu) [7/Feb/90] (19851 bytes)
	 pbc --> The data set found in appendix D of Fleming and
	    Harrington, Counting Processes and Survival Analysis, Wiley,
	    1991.  Submitted by therneau@Mayo.EDU (Terry Therneau),
	    [25/Jul/94] (36 kbytes)
	 places --> Data taken from the Places Rated Almanac, giving the
	    ratings on 9 composite variables of 329 locations.  (From an
	    ASA data exposition, 1986) The data are in one file:
		 places.data --> A shar archive of three files which
		    document the data, present the data itself, and provide
		    a key to the actual places used.  (27720 byes)

	 pollen --> Synthetic dataset about the geometric features of
	    pollen grains.  There are 3848 observations on 5 variables.
	    From the 1986 ASA Data Exposition dataset, made up by David
	    Coleman of RCA Labs.  The data are in one file:
		 pollen.data --> A shar archive of 9 files.  The first file
		    gives a short description of the data, then there are 8
		    data files, each with 481 observations.  (205954 bytes)

	 pollution --> This is the pollution data so loved by writers of
	    papers on ridge regression.  Source: McDonald, G.C.  and
	    Schwing, R.C.  (1973) 'Instabilities of regression estimates
	    relating air pollution to mortality', Technometrics, vol.15,
	    463-482.  (8540 bytes)
	 profb --> Scores and point spreads for all NFL games in the
	    1989-91 seasons.  Contributed by Robin Lock
	    (rlock@stlawu.bitnet) [15/Sept/92] (27733 bytes)
	 rabe --> This file contains data from Regression Analysis By
	    Example, 2nd Edition, by Samprit Chatterjee and Bertram Price,
	    John Wiley, 1991.  (schatter@stern.nyu.edu) [6/Feb/92] (40309
	    bytes)
	 rir --> This file contains data from Residuals and Influence in
	    Regression, R.  Dennis Cook and Sanford Weisberg, Chapman and
	    Hall, 1982.  (sandy@umnstat.stat.umn.edu) (5206 bytes).
	    [Updated 25/May/93]
	 riverflow --> Datasets mentioned in "Parsimony, Model Adequacy and
	    Periodic Correlation in Time Series Forecasting", ISI Review,
	    A.I.  McLeod (1992, to appear).  Submitted by A.Ian McLeod
	    (aim@stats.uwo.ca).  Time series data.  A shar archive.
	    [22/Jan/92] (294052 bytes).
	 sapa --> time series used in "Spectral Analysis for Physical
	    Applications" by D.  B.  Percival and A.  T.  Walden, Cambridge
	    University Press, 1993.  (dbp@apl.washington.edu)
	    [4/Nov/92](50788 bytes)
	 saubts --> Two ocean wave time series used in "Spectral Analysis
	    of Univariate and Bivariate Time Series" by D.  B.  Percival,
	    Chapter 11 of "Statistical Methods for Physical Science,"
	    edited by J.  L.  Stanford and S.  B.  Vardeman, Academic
	    Press, 1993.  (dbp@apl.washington.edu) [14/Apr/93] (47 kbytes)
	 ships --> Ship damage data, from "Generalized Linear Models" by
	    McCullagh and Nelder, section 6.3.2, page 137.
	    (therneau@mayo.edu) (1709 bytes)
	 sleep --> Data from which conclusions were drawn in the article
	    "Sleep in Mammals: Ecological and Constitutional Correlates" by
	    Allison, T.  and Cicchetti, D.  (1976), _Science_, November 12,
	    vol.  194, pp.  732-734.  Includes brain and body weight, life
	    span, gestation time, time sleeping, and predation and danger
	    indices for 62 mammals.  Submitted by Roger Johnson
	    (rjohnson@carleton.edu) [27/Jul/94] (8k)
	 socmob --> Social Mobility (US, 1973).  Two four-way 17x17x2x2
	    contingency tables: Father's occupation, Son's occupation
	    (first and current), family structure, race.  Submitted by
	    Timothy J.  Biblarz (biblarz@uscvm.bitnet).  [corrected
	    25/Jan/93]
	 stanford --> Two versions of the Stanford Heart Transplant Data,
	    one "The Statistical Analysis of Failure Time Data" by
	    Kalbfleisch and Prentice, Appendix I, pages 230-232, the other
	    from the original paper by Crowley and Hu.  (therneau@mayo.edu)
	    (15003 bytes) [Corrected, 8/Mar/93]
	 tumor --> Tumor Recurrence data for patients with Bladder cancer
	    Taken from Wei, Lin and Weissfeld, JASA 1989, p 1067.  From:
	    therneau@mayo.edu (Terry Therneau) [23/Mar/93]
	 veteran --> Veteran's Administration Lung Cancer Trial, Taken from
	    Kalbfleisch and Prentice, pages 223-224 (therneau@mayo.edu)
	    (8249 bytes)
	 visualizing.data --> This shar file contains 25 data sets from the
	    book Visualizing Data published by Hobart Press
	    (books@hobart.com) and written by William S.  Cleveland
	    (wsc@research.att.com).  There is also a README file so there
	    are 26 files in all.  Each of the 25 files has the data in an
	    ascii table format.  The name of each data file is the name of
	    the data set used in the book.  To find the description of the
	    data set in the book look under the entry "data, name" in the
	    index.  For example, one data set is barley.  To find the
	    description of barley, look in the index under the entry "data,
	    barley".  The S archive of Statlib has a file created by S that
	    contains the data sets in a format that makes it easy to read
	    them into S.  (536 kbytes) [12/Nov/93][17/Oct/94]
	 wind --> daily average wind speeds for 1961-1978 at 12 synoptic
	    meteorological stations in the Republic of Ireland (Haslett and
	    Raftery, Applied Statistics 1989).  There is a LARGE amount of
	    data.  Please be sure you want it before you ask for it!! There
	    are two entries to obtain.
		 wind.desc --> A short desciption of the data (815 bytes)
		 wind.data --> The data (532494 bytes).

	 witmer --> A shar archive of data from the book Data Analysis: An
	    Introduction(1992) Prentice Hall bu Jeff Witmer.  Submitted by
	    Jeff Witmer (fwitmer@ocvaxa.cc.oberlin.edu) [28/Jun/94] (29
	    kbytes)
	 wseries --> These data tell whether or not the home team won for
	    each game played in all World Series prior to 1994.  The data
	    appear as the STATS Challenge for Issue 11.  Submitted by Jeff
	    Witmer (fwitmer@ocvaxa.cc.oberlin.edu) [20/Mar/94] (3 kbytes)
	 submissions --> Information on how to submit data to this archive.

---------------------------------------------------------------------------

Other Sources

For WWW (Mosaic, Netscape, etc.) users, here are a set of links to other
sources of Data.  These sources are not kept on StatLib, and we have no
control over them.  If you find a link is consistently not working, let us
know.

	 EconData --> Several hundred thousand economic time series,
	    produced by the U.S.  Government and distributed by the
	    government in a variety of formats and media, have been put
	    into a standard, highly efficient, easy-to- use form for
	    personal computers.
	 Oceanographic & Earth Science Data --> From Scripps
	    Institution of Oceanography Library
	 The Data Zoo --> California coastal data collection
	 Journal of Statistics Education Information Service --> Also has
	    some data

---------------------------------------------------------------------------

Credit where credit is due

If you use an algorithm, dataset, or other information from StatLib, please
acknowledge both StatLib and the original contributor of the material.

---------------------------------------------------------------------------

Last edited on Tue Nov 8 1994 by Mike Meyer