[R] Regression lines for differently-sized groups on the same plot

Wed Jul 20 01:53:23 CEST 2005

Hi there,
  I've looked through the very helpful advice about adding fitted lines to 
plots in the r-help archive, and can't find a post where someone has offered 
a solution for my specific problem.  I need to plot logistic regression fits 
from three differently-sized data subsets on a plot of the entire dataset.  
A description and code are below:
  I have an unbalanced dataset consisting of three different species (hem, 
yb, and sm), with unequal numbers of wood pieces in each species group.  I 
am trying to generate a plot that will show the size of the wood piece on 
the X axis, the probability of it having tree seedlings growing on it on the 
Y (a binomial yes or no variable), and three fitted curves showing how the 
probability of having tree seedlings changes with increasing wood piece size 
for each species.
  I have no problem generating fits using GLM, and no problem creating the 
plot.  However, if I try to add a fitted curve based only on the hem data 
subset to a plot that shows the entire dataset, I get an error message that 
the lengths of those data sets differ. "Error in xy.coords(x,y) : x and y 
lengths differ".  I could see R's point -- you can't plot a regression line 
of babies born as a function of stork abundance on a graph of cherries 
produced (Y) versus rainfall (X), which for all the program knows, I'm 
trying to do.  As a temporary fix, I added NAs to the end of the hem, yb, 
and sm subsets to make them the same length as the entire dataset.  I can 
now add my fitted curves to the plot, but the lines are not connected.  That 
is, if the hem group only contains wood pieces that are 1, 4, and 10 meters 
long, the plot has an X axis that ranges from 1 to 10, but line segments for 
the hem group regression line only appear above 1, 4, and 10.  How can I fix 
this?  An ideal solution would not require me to make the hem subset of my 
data the same length as the full dataset, either (although the summaries of 
regressions with the NAs (or zeroes) added and taken away are identical).  
I'd also settle for a work-around that would have R connect the pieces of 
the curve so that I get a solid line rather than small dots and dashes where 
actual data exist.  Thanks so much for your help!
  Laura Marx
  Michigan State University, Dept. of Forestry 

#Note: hemdata has all the rows that are not hemlock species replaced with 
#"NA"s.
hemhem=glm(hempresence~logarea, family=binomial(logit), data=hemdata)
hemyb=glm(hempresence~logarea, family=binomial(logit), data=birchdata)
hemsm=glm(hempresence~logarea, family=binomial(logit), data=mapledata) 

attach(logreg) #logreg is the full dataset
plot(logarea, hempresence, xlab = "Surface area of log (m2)", 
ylab="Probability of hemlock seedling presence", type="n", font.lab=2, 
cex.lab=1.5, axes=TRUE)
lines(logarea,fitted(hemhem), lty=1, lwd=2)
lines(logarea,fitted(hemyb), lty="dashed", lwd=2)
lines(logarea,fitted(hemsm), lty="dotted", lwd=2)