[R-sig-teaching] I need your thoughts on teaching with R

Mon Mar 30 23:12:17 CEST 2009

On Wed, Mar 11, 2009 at 9:52 PM, Derek Ogle <DOgle at northland.edu> wrote:
> Andrew,
>
> I teach a intro statistics course to "science" students and some "general education" students and a "biometry" course to "natural resources" and a handful of other students at Northland College.  I have been using R in both of these statistics course, and in my fisheries science course, for four or five years now.  Below are my answers to your questions.  I would be happy to expand on these if you needed me to (though now coming back to re-read I see that I have typed quite a bit).
>
>
>> 1) What are the instructional decisions that a person needs to make if they are going to be teaching statistics using R?
>
> In general, I don't really think that these decisions are unique to R.  No matter the software I believe that an instructor, especially of an introductory class, has to make a decision of whether learning the software is one of the outcomes of the course or not.  When I taught with other software (Minitab) I chose not to have learning the software as an outcome.  However, when I began using R I decided that at least some understanding of the software should be an outcome because I felt that knowing R was adding value to the student.  I believe that this added value was especially important in my upper-level courses so it became important to me to make sure that students in the intro class were gaining some knowledge of the software (R).
>
> Once an instructor chooses to use R, I believe they must decide whether to use one of the GUIs, whether to use an external editor, which "graphics system" (base, ggplot2, lattice) to use, or whether to use package-specific or base functions.
>
>
>> 2) What decisions have you yourself made and what were your reasons?
>
> Of the latter items mentioned above I chose NOT to use a GUI.  I am familiar with RCMDR, for example, but, personally, I think the power of R rests in the command line.  I do use TINN-R as an external editor because I like the ability for students to save their commands and recycle them for future problems.  I have not seen any "cost" to the student of using TINN-R (it is simple to learn).  I chose to use base graphics because of the simplicity (in my mind) of their functions (for doing the basic graphics needed in most intro classes).
>
> I have also written a package of R functions that streamlines some of the base R functions.  For example, I have written a Subset() function that combines the base subset() and drop.levels() so that I don't have to explain to students the subtleties of why subset() does not drop the level from the list of possible levels for a factor variable after subsetting.  I have also written a function that can be used to provide a graphical display of probability calculations on a suite of probability distributions (motivated by a post by Dr. Bates on this list last year).  I did not want to create a large number of special purpose functions so I attempted to judiciously choose functions that simplified complexities or subtleties that I did not want students to be concerned with or that provided specific pedagogical advantages.  <BTW, my package is surely not up to the standards of other package developers but if anyone is interested it is available at www.rforge.net/NCStats.  A newer version using namespace will be up there when my semester is over in April.>  I also use some of the functions in the TeachingDemos package.
>
> Finally, I made the conscious decision of not using the phrases "R programming" or "R coding."  It is my experience that many students do not consider themselves capable of "computer programming."  I explain that the functions are simply replacements for menu'd commands but that they can be saved and reproduced.  At most, I refer to "R scripts" but never "R programs."
>
>
>> 3) How do you teach with R? Do you have sessions on R and other sessions where content is taught? Is the computing fully integrated with the content? Or some combinationn?
>
> At Northland, I teach two 2-hour sessions a week for the intro class.  Generally, I use some portion of this time for a traditional lecture, some portion to teach "doing statistics with R", and then some time for the students to work independently with R and ask me questions.  The traditional lecture and the "doing statistics with R" sections follow each other closely so I would consider them to be "integrated."  In the "doing statistics with R" section I have found it better to provide students with a handout (easily accomplished by the instructor with Sweave()) rather than demonstrating with the computer or letting them type commands into the computer.  The two main advantages to this are that it is easier to keep students in roughly the same spot (if they are typing things in themselves then invariably someone forgot a comma, or misspelled, etc, and you spend all of your time troubleshooting individual students while others sit there or start surfing the net) and it lets the students see a correct set of commands on which they can take detailed notes (if they are typing in commands themselves they are just typing and not thinking and not taking notes that they can return to).
>
> In the upper-level course, I fully integrate R into the notes and lecture.
>
> You can seen what I do, if interested, because all of my class materials ("book", handouts, lecture slides) are available at www.ncfaculty.net/dogle/ and then follow the intro stats, biometry, or fisheries science "buttons."
>
>
>> 4) If you have the heterogeneous group of students (some going on to program in R, others just trying to get through, etc.) how do we deal with this? Do we need to have different types of assignments and materials for the different students?
> Again, I don't see this as a specific issue with using R -- i.e., even if we did all calculations by hand there would be heterogeneity amongst those that will continue with other statistics courses and those "just trying to get through."
>
>
>> 5) A few more comments.
> When I first started teaching with R I was just learning the program myself.  I was terrible at teaching with it (and the students hated it) but I believed that it was the correct software to use.  Now, five years later, I am a much better teacher of statistics and R and, most of my students either "enjoy" R (I have seen students become more accepting of typing in a command line rather than a GUI -- I believe that this is due to the amount of texting and chatting that they do relative to students from just a few years ago) or are, at least, accepting of the idea of "R".  I even more firmly believe that this is the proper software for students to learn.  Be firm in your commitment to R if you choose to teach with it but also be patient with yourself and your students during the first few years.
>
> I do spend some time trying to teach students the "lexicon of R" as this makes communication about specific functions easier.  For example, my students will know about constructor and extractor functions, named and positional arguments, objects, assignment operators, vectors, data.frames, etc.  This has been a "cost" of using R in the sense that it does take time and, thus, "something else" had to go from my curriculum.
>
> I also think that R is especially beneficial for my biometry course which focuses on regression (simple, multiple, and indicator variable), anova (one- and two-way), and logistic regression because the vast majority of these topics use a common set of functions -- lm(), anova(), summary(), confint(), etc.  In addition, I have added a few functions (in my NCStats function) to make fitted line plots, residual plots, and to compare all slopes.  Thus, students can learn a vast number of topics with very few commands in R.  The same can be said in the intro class for t.test() (one-sample, two-sample, matched-pairs) and chisq.test() (goodness-of-fit and general chi-square).  This efficiency is very convenient and powerful.
>
> Finally, I have found R to be useful because students "like" free software, open-source concepts (at least at my environmental liberal arts college), and being on the "cutting edge."  The more students like in your class the more likely they are to learn.

I agree with what Derek, my neighbor to the north, has said.

I teach introductory engineering statistics using R and have done so
for several years, although I am never completely satisfied with how R
blends with the text in such a course.  I have tried using a standard
introductory engineering text, specifically Devore's "Probability and
Statistics for Engineering and the Sciences", supplemented with
material on R (see the Devore6 package on CRAN which John Verzani
updated for the 7th edition to Devore7), Peter Dalgaard's
"Introductory Statistics with R" and now Cohen and Cohen's "Statistics
and Data with R".  I have also looked at "Probability and Statistics
with R" by Ugarte et al.

With the exception of Peter's book I found myself fighting the text.
That is, I found myself saying "the text presents this material this
way but it is unnecessary and confusing.  Do things this other way."

In the case of Peter's book I could agree with his presentation but
the book is clearly oriented toward biostatistics and has little
coverage of probability.  It came about as a supplement to another
text used in a course and reads like that so it has to be supplemented
extensively, especially if your audience is not from medical fields.

I would dearly love to see an approach to teaching statistics that
takes advantage of the graphical and computational capabilities of R
to remove redundant topics from the typical introductory course.
Sadly the last two texts I list (Cohen and Cohen, 2008;  Ugarte et al,
2008) do exactly the opposite.  Instead of using R to simplify an
approach to statistics they complicate an introductory course by
adding page after page of confusing R code.

What do I mean by simplify?  There are many topics in an introductory
statistics course that are ingrained in the curriculum but really are
there for the sake of approximation or computational simplification.
How many introductory texts still describe how to approximate a
"difficult" distribution by a "simpler" distribution (hypergeometric
by binomial, binomial by Poisson or Gaussian, etc.)?  When you can
calculate the exact probability why do you want to waste time teaching
an approximation and rules like "when np > 5 ..."?  Even a basic
graphical presentation, the histogram, is outmoded.  The purpose of
the histogram is to give us a picture of the density.  Why not use a
density plot for this?  There is a great advantage in that you can
easily overlay density plots from different groups, not to mention the
fact that it shows a smooth approximation to the density.  In the past
we used histograms because it was comparatively simple to choose bins
and count the observations in the bins then produce a bar chart.  We
can do better than that now.

Think carefully about the graphics.  Deepayan Sarkar (lattice) and
Hadley Wickham (ggplot2) have provided powerful techniques for
exploring data.  Students should benefit from that if they can do so
without needing to learn many, many details of the language.

When teaching the principles of hypothesis testing I describe a
p-value as "the probability of seeing the data that we did or
something more unusual when the null hypothesis (usually meaning "no
change") is true".  The closer that probability is to "impossible",
the stronger the evidence against the null hypothesis in favor of the
alternative.  The point is that we should go directly to the p-value.
All the confusing material about picking a level and calculating the
rejection region is there because we couldn't calculate that
probability when I took an introductory course more than 40 years ago.
 All we had then were slide rules, pencil and paper, and a few tables
in a book.  We can do better than that now.

Do we need to describe computational formulas in a text book?  It
turns out that just about every formula in an introductory text,
beyond the calculation of the sample mean, is not really the way that
the calculation is done.  Most of us know that the "short cut" formula
for the sample variance has bad numerical properties and a few might
know that regression coefficients are not really evaluated by
inverting X'X.  Why teach a formula that is only good for a simplified
situation, like a simple linear regression model?  Why not say that we
minimize the residual sum of squares and leave it at that?  Pay more
attention to model building and examining residuals.

In teaching I think it is important to strive for simplicity and
consistency in the use of R.  Keep the R code as concise as possible.

I prefer to teach lattice graphics because I think the graphics are
informative and because all the lattice functions can be called with a
formula/data pair of arguments, just as t.test, aov, lm, glm, nls,
etc. can be called with formula/data.  I use Sweave and the beamer
LaTeX class to generate the slides for my classes so that I can
extract the R code and make that available on the course web site.
The slides and class presentations describe the graphics calls
succinctly, if at all, but the detailed code is available for
examination if the students want to delve deeper.

In short, the worst way to use R in an introductory course is to teach
the same-old-same-old material augmented with page after page of
confusing R code.  Try to use the power of the computer and the
software to aid insight into data and to simplify the ideas of
statistics.

I have over the years produced slides for classes based first on
Devore's books then on Peter's book and now on the Cohen and Cohen
book.  I am willing to make these available, including the source
code, so others can borrow code or presentation approaches if they
wish.  I am not familiar with open documentation licenses like
Creative Commons.  If it would help to stimulate discussion I will
make them available without copyright.  I would be particularly
interested in corresponding with potential text book authors on some
of the techniques that I think can be used to simplify presentation of
R code and graphics.  I don't have plans to embark on writing a text
myself.