[Rd] input for R-intro (PR#988)

pauljohn@ku.edu pauljohn@ku.edu
Mon, 18 Jun 2001 21:03:10 +0200 (MET DST)


This is a multi-part message in MIME format.
--------------1C95DB321C5D7CBC643B8CA3
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

I asked Prof. Brian Ripley if I might submit some text/input for the
R-intro writers, he said this is the place to do it.

I was on a plane a couple of weeks ago reading R-intro as I plan for a
course and made these notes, some of which I think will help my students
if you incorporate them.  I've attached a raw text file I wrote in Emacs
to this message "Rintrochanges.txt".

I have written documents like R-intro before and understand it is
difficult to please everybody, and I thank you for your effort on the
project.

Paul Johnson
Assoc. Professor
Dept. of Political Science
University of Kansas

--------------1C95DB321C5D7CBC643B8CA3
Content-Type: text/plain; charset=us-ascii;
 name="Rintrochanges.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="Rintrochanges.txt"

R-intro comments/enhancements  Part I.

I think something more verbose like this would help under
"Ordered and Unordered factors"

Texts on research design emphasize that variables have
different levels of measurement. Typically, there are three levels.
If a measurement is taken with a precise scale (one on which it is
meaningful to say new_value = b*old_value, as with readings of a
thermometer), it is considered an "interval" level variable.  Many
statistical models were originally designed for interval level
variables.  If a measurement is not so precise, but only provides an
ordered categorization of observations, such as "low" "medium" and
"high", it is called an ordinal variable.  If a measurement is
categorical and implies no sense of ordering, such as "European",
"Asian", or "South American", then it is typically called an ordinal
variable.  Assigning the values of "1", "2", and "3" to these
observations has no substantive significance, as we might as well
number them "3","2","1".

Many statistical models are designed for interval level data, but with
appropriate recoding of input, they can also be made to work with
nominal and ordinal data.  In the "old days" of statistical computing,
users were forced to create a lot of indicator (or "dummy") variables,
thereby converting information from nominal or ordinal measurements
into variables that have only values 0 and 1.  For example, to
represent the information about the home continents of survey
respondents, we might create two dummy variables, one for Asia (coded
1 if respondent is from Asia, 0 otherwise) and one for South America
(likewise 0 and 1).  A respondent for whom we find both of these dummy
variables equal to 0 is sure to be from Europe, so we don't usually
need to create three dummy variables.  Doing so would be redundant.

With a modern language like R, much of the drudgery of creating the
indicator variables is handled automatically if the user declares the
variables properly.  In R, and S before it, the term "factor" is used
to refer to a variable that is not measured at the interval level. An
"unordered factor" is a nominal (or categorical) variable, while an
"ordered factor" is the term for an ordinal measurement.  Many
statistical procedures in R have builtin procedures for handling
factors.

Beyond handling coding of noninterval variables, factors play a vital
role in many R procedures.  Procedures can be designed to do a chore
for each value of a factor.  Many procedures will in fact insist that
a variable be designated as a factor.

 ------------------


In this section "Named arguments and defaults", I am left in a
confusion about a few things.  This confusion has bothered me a while.
Can you do more to explain which arguments must be specified, which
are optional, and under what conditions is it necessary to include the
parameter= which calling a function.

1.  When I use a function, and I look in its help page, how do I know
which input variables must be declared and which can be ignored?  This
is quite mysterious.  When invocations leave out some arguments, how
does R "sift through" the ones that are provided to know which is
which?


For example, consider the page on "update.packages".  

Under Usage, I find:
update.packages(lib.loc = .lib.loc, CRAN = getOption("CRAN"),
                contriburl = contrib.url(CRAN),
                method = "auto", instlib = NULL,
                ask=TRUE, available=NULL, destdir=NULL)

I've been using computers a long time, and I use this function every two months or so.  Between times I forget, come back to this page, and still this one leaves me totally confused.  

After fiddling a while, I find this works on Linux:
>update.packages(lib="/usr/lib/R/library",CRAN="http://lib.stat.cmu.edu/R/CRAN")

but what bothered me thd most is the "getOption("CRAN") option for CRAN.  I tried many variations on that and never figured how to make it work.


Maybe any clarifications could be related to the section "Getting help with functions and features"

In this section, I think it would be nice to have a paragraph about
"how to read a help page."  Some things are not obvious to all new
users.  Some things in help pages are tough to describe... Maybe I'd
say

How to read a help page.

If you find a help page, notice that it is made up of different parts.

1. "Description". These are usually terse, so read every word
   carefully.

2. Look for the "Value" heading.  That tells you what this procedure is
   going to return when you call it. If this procedure doesn't return
   what you need, you should find out immediately.

3. Under "Usage", most help pages list a number of possible commands
   that can be used.  These are not commands to be typed
   literally. Many of them contain abstract references which provide
   information about the nature of the variables that might be
   entered.  

   Here is a simple case.  The coefficients help page has this:

Usage

coef(object, ...)
coefficients(object, ...)

   Clearly, coef and coefficients are going to do the same thing, and you must supply the appropriate object.

4. Under "Arguments," the help page indicates what inputs this
   procedure wants.  The inputs are described line-by-line, and often
   the comments indicate of parameters are optional or what their
   default value might be.

   Many help pages have the infamous "...", which is something like
   "fill in the appropriately, depending on whatever context you
   find."  In some situations this is easy to understand .  Many help
   pages will eplicitly say that "..." refers to any options that
   might be used with a particular kind of object, and in that case
   one must find the help page for that other object. For example, the
   dev.copy help page clearly indicates that "..." refers to any
   options that might be used with the particular device in question.

   Other help pages are not so informative.  Consider the output from ?coefficients:

Arguments

 object
       an object for which the extraction of model coefficients is meaningful.
 ...
       other arguments.

Here it is clear you need to give a statistical object to have
coefficients removed.  The "..." is a mystery, and most R users will
try to use the coefficients command without worrying about it, and
then backtrack if an error occurs.  And, if an error occurs, there may be no recourse but to read the source code for this command or ask  about it in the r-help email list.

--------------1C95DB321C5D7CBC643B8CA3--


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._