[R] How to separate a data set by its factors

David Winsemius dwinsemius at comcast.net
Fri Dec 25 20:00:45 CET 2009


On Dec 25, 2009, at 9:38 AM, James Rome wrote:

> Thanks for the help.
>
> I tried making the pdf file as suggested. Acrobat said it was damaged
> and could not be opened. Is this an R bug?

Hard to say. Graphics devices vary from OS to OS and I am on a Mac  
using a 64 bit bit version of R 2.10.1. I get no error with opening  
the pdf file so created using either the native Mac graphic viewer,  
Preview, or the Adobe Acrobat Reader. To decide whether you have  
identified a bug you would need to provide full sessionInfo and code,  
and then someone using your OS could try to reproduce the problem.

> It did make a PostScript file that I was able to distill into PDF, but
> it was gray scales. How do I get the color back?
> And yes, I did do the layout I wanted so I could see how the days
> compared for each hour.
>
> On 12/24/09 4:56 PM, David Winsemius wrote:
>>
>>
>> pdf(test.pdf")
>> xyplot(Sepal.Length + Sepal.Width ~ Petal.Length + Petal.Width |
>> Species, data = iris, scales = "free", layout = c(2, 1, 2),  
>> auto.key =
>> list(x = .6, y = .7, corner = c(0, 0)))
>> dev.off()
>> You may not be getting what you expect, but it may be that your plots
>> are all being created, but too quickly to be seen. Try printing to a
>> more durable "canvas".
>>
>>> And I would like to add a Poisson Distribution fit to each of these
>>> plots (see below), but am clueless as to how to go about it.
>>>
>>> I would like to fit a distribution to the count data for each
>>> combination of day and hour, and I am unable to see how to do this  
>>> in a
>>> vector manner.  For example, I tried
>>> density((Arrival.Val | DAY*Hour), na.rm=TRUE)
>>> which does not work.
>>
>> I should think the this would be informative:
>>
>> glm(Arrival.Val ~ DAY*Hour, family="poisson")
>>
>> Since DAY and Hour are factors you will get a large number of
>> estimates. You can use the typical regression functions, such as
>> predict() and summary() to get the fitted values.
>>
> I tried glm:
> ---------
>> glm(Arrival.Val ~ DAY*as.factor(Hour), family="poisson")
>
> Call:  glm(formula = Arrival.Val ~ DAY * as.factor(Hour), family =
> "poisson")
>

This output came across rather mangled.

> Coefficients:
>                           (Intercept)
> DAY[T.Monday]
>                               3.15396
> -0.61348
>                       DAY[T.Saturday]
> DAY[T.Sunday]
>                              -0.43853
> -0.93475
>

  snipped

>
> 0.39860
>  DAY[T.Tuesday]:as.factor(Hour)[T.23]
> DAY[T.Wednesday]:as.factor(Hour)[T.23]
>                               0.43209
> 0.49274
>
> Degrees of Freedom: 8124 Total (i.e. Null);  7963 Residual
>  (18 observations deleted due to missingness)
> Null Deviance:        40120
> Residual Deviance: 17030     AIC: 59170
> ----------------
> I am not sure what to make of this.

Those are estimated Poisson means (on a log-scale) for each of your  
factors, DAY and Hour.


> So how do I get the fit plotted on top of my histograms?

See if this helps understand how a model relates to a data situation:
 > testsim <- data.frame(Arrivals = rpois(24*3,  
lambda=c(rep(10,5),rep(40,4),rep(20,6), rep(40,4), rep(10,  
24-5-4-6-4))), Hour= factor(rep(1:24, 3)), DAY=Sys.Date()+1:3 )

 >testsim

 > testsim
    Arrivals Hour        DAY
1         9    1 2009-12-26
2         6    2 2009-12-27
3        10    3 2009-12-28
4        10    4 2009-12-26
5        13    5 2009-12-27
6        34    6 2009-12-28
7        34    7 2009-12-26
8        35    8 2009-12-27
# output elided (72 lines)

 > glm(Arrivals ~ Hour, data=testsim)

Call:  glm(formula = Arrivals ~ Hour, data = testsim)

Coefficients:
(Intercept)        Hour2        Hour3        Hour4        Hour5         
Hour6        Hour7
      8.3333      -1.3333       0.3333       0.6667       4.0000       
29.6667      28.0000
       Hour8        Hour9       Hour10       Hour11       Hour12        
Hour13       Hour14
     30.3333      32.3333      11.3333      11.3333      10.6667       
11.3333      12.6667
      Hour15       Hour16       Hour17       Hour18       Hour19        
Hour20       Hour21
      8.6667      31.0000      26.6667      33.0000      28.6667        
4.0000      -1.0000
      Hour22       Hour23       Hour24
      0.3333       4.6667       0.3333

Degrees of Freedom: 71 Total (i.e. Null);  48 Residual
Null Deviance:	    12400
Residual Deviance: 1004 	AIC: 444.1

The estimates are Intercept + factor_coefficient.

>
> Is there a way to save the bin data from the histogram command?

I don't know if the lattice function supports that, but the base  
graphics function hist lets you get the breaks and counts.

?hist

You might need to wrap it in a call to tapply, since the help page  
does not say that a formula method is available, and I do not see a  
formula method with methods(hist). (There might be easier ways to get  
counts, such as xtabs() which supports a formula method.

?xtabs

>>
>>
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>>
> Again Thanks for the prompt holiday response.
> Jim Rome

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list