[R] How to separate a data set by its factors
David Winsemius
dwinsemius at comcast.net
Fri Dec 25 20:00:45 CET 2009
On Dec 25, 2009, at 9:38 AM, James Rome wrote:
> Thanks for the help.
>
> I tried making the pdf file as suggested. Acrobat said it was damaged
> and could not be opened. Is this an R bug?
Hard to say. Graphics devices vary from OS to OS and I am on a Mac
using a 64 bit bit version of R 2.10.1. I get no error with opening
the pdf file so created using either the native Mac graphic viewer,
Preview, or the Adobe Acrobat Reader. To decide whether you have
identified a bug you would need to provide full sessionInfo and code,
and then someone using your OS could try to reproduce the problem.
> It did make a PostScript file that I was able to distill into PDF, but
> it was gray scales. How do I get the color back?
> And yes, I did do the layout I wanted so I could see how the days
> compared for each hour.
>
> On 12/24/09 4:56 PM, David Winsemius wrote:
>>
>>
>> pdf(test.pdf")
>> xyplot(Sepal.Length + Sepal.Width ~ Petal.Length + Petal.Width |
>> Species, data = iris, scales = "free", layout = c(2, 1, 2),
>> auto.key =
>> list(x = .6, y = .7, corner = c(0, 0)))
>> dev.off()
>> You may not be getting what you expect, but it may be that your plots
>> are all being created, but too quickly to be seen. Try printing to a
>> more durable "canvas".
>>
>>> And I would like to add a Poisson Distribution fit to each of these
>>> plots (see below), but am clueless as to how to go about it.
>>>
>>> I would like to fit a distribution to the count data for each
>>> combination of day and hour, and I am unable to see how to do this
>>> in a
>>> vector manner. For example, I tried
>>> density((Arrival.Val | DAY*Hour), na.rm=TRUE)
>>> which does not work.
>>
>> I should think the this would be informative:
>>
>> glm(Arrival.Val ~ DAY*Hour, family="poisson")
>>
>> Since DAY and Hour are factors you will get a large number of
>> estimates. You can use the typical regression functions, such as
>> predict() and summary() to get the fitted values.
>>
> I tried glm:
> ---------
>> glm(Arrival.Val ~ DAY*as.factor(Hour), family="poisson")
>
> Call: glm(formula = Arrival.Val ~ DAY * as.factor(Hour), family =
> "poisson")
>
This output came across rather mangled.
> Coefficients:
> (Intercept)
> DAY[T.Monday]
> 3.15396
> -0.61348
> DAY[T.Saturday]
> DAY[T.Sunday]
> -0.43853
> -0.93475
>
snipped
>
> 0.39860
> DAY[T.Tuesday]:as.factor(Hour)[T.23]
> DAY[T.Wednesday]:as.factor(Hour)[T.23]
> 0.43209
> 0.49274
>
> Degrees of Freedom: 8124 Total (i.e. Null); 7963 Residual
> (18 observations deleted due to missingness)
> Null Deviance: 40120
> Residual Deviance: 17030 AIC: 59170
> ----------------
> I am not sure what to make of this.
Those are estimated Poisson means (on a log-scale) for each of your
factors, DAY and Hour.
> So how do I get the fit plotted on top of my histograms?
See if this helps understand how a model relates to a data situation:
> testsim <- data.frame(Arrivals = rpois(24*3,
lambda=c(rep(10,5),rep(40,4),rep(20,6), rep(40,4), rep(10,
24-5-4-6-4))), Hour= factor(rep(1:24, 3)), DAY=Sys.Date()+1:3 )
>testsim
> testsim
Arrivals Hour DAY
1 9 1 2009-12-26
2 6 2 2009-12-27
3 10 3 2009-12-28
4 10 4 2009-12-26
5 13 5 2009-12-27
6 34 6 2009-12-28
7 34 7 2009-12-26
8 35 8 2009-12-27
# output elided (72 lines)
> glm(Arrivals ~ Hour, data=testsim)
Call: glm(formula = Arrivals ~ Hour, data = testsim)
Coefficients:
(Intercept) Hour2 Hour3 Hour4 Hour5
Hour6 Hour7
8.3333 -1.3333 0.3333 0.6667 4.0000
29.6667 28.0000
Hour8 Hour9 Hour10 Hour11 Hour12
Hour13 Hour14
30.3333 32.3333 11.3333 11.3333 10.6667
11.3333 12.6667
Hour15 Hour16 Hour17 Hour18 Hour19
Hour20 Hour21
8.6667 31.0000 26.6667 33.0000 28.6667
4.0000 -1.0000
Hour22 Hour23 Hour24
0.3333 4.6667 0.3333
Degrees of Freedom: 71 Total (i.e. Null); 48 Residual
Null Deviance: 12400
Residual Deviance: 1004 AIC: 444.1
The estimates are Intercept + factor_coefficient.
>
> Is there a way to save the bin data from the histogram command?
I don't know if the lattice function supports that, but the base
graphics function hist lets you get the breaks and counts.
?hist
You might need to wrap it in a call to tapply, since the help page
does not say that a formula method is available, and I do not see a
formula method with methods(hist). (There might be easier ways to get
counts, such as xtabs() which supports a formula method.
?xtabs
>>
>>
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>>
> Again Thanks for the prompt holiday response.
> Jim Rome
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list