[R] question about using _apply and/or aggregate functions
Clifford Long
gnolffilc at gmail.com
Tue Jun 23 00:16:17 CEST 2009
Hi David,
I appreciate the advice. I had coerced 'list4' to as.list, but forgot
to specify "list=()" in the call to aggregate. I made the correction,
and now get the following:
> slope.mult = simarray[,1]
> adj.slope.value = simarray[,2]
> adj.slope.level = simarray[,2]
> qc.run.violation = simarray[,5]
> simarray.part = cbind(slope.mult, adj.slope.value, qc.run.violation, adj.slope.level)
> list4 = as.list(simarray.part[,4])
> agg.result = aggregate(simarray.part[,3], by=list(list4), FUN = mean)
Error in sort.list(unique.default(x), na.last = TRUE) :
'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
... I'm not sure what this means that I've done wrong. I did check
'list4' using "is.list", and get "TRUE" back as an answer, so feel
that my mistake is some other fundamental aspect of R that I'm failing
to grasp.
To your note on 'tapply' ... I did try this as well (actually, tried
it first) with no initial success. On your recommendation, I gave
tapply another go, and get something recognizable:
vtt = tapply(simarray.part[,3], simarray.part[,2], mean)
> dim(vtt)
[1] 50
> length(vtt)
[1] 50
> vtt[1:5]
0.003132 0.006264 0.009396 0.012528 0.01566
0.29 0.24 0.23 0.16 0.22
> vtt[1]
0.003132
0.29
> vtt[[1]][1]
[1] 0.29
I see that the output stored in "vtt" has one dimension with
length=50. But each place in vtt appears to hold two values. I'll
admit that I'm not sure how to access/reference the entirety of all 50
values = 0.29 0.24 0.23 0.16 0.22 (and so on). I don't appear to
be able to access/reference only what appears to be an embedded index
= 0.003132 0.006264 0.009396 etc. What am I missing? Is there
a reference that I need to re-read? I would like to be able to plot
one against the other.
Thanks again for taking the time outside of your "day job" for your
earlier reply!
Cliff
On Mon, Jun 22, 2009 at 11:28 AM, David Winsemius<dwinsemius at comcast.net> wrote:
>
> On Jun 22, 2009, at 12:04 PM, Clifford Long wrote:
>
>> Hi R-list,
>>
>> I'll apologize in advance for (1) the wordiness of my note (not sure
>> how to avoid it) and (2) any deficiencies on my part that lead to my
>> difficulties.
>>
>> I have an application with several stages that is meant to simulate
>> and explore different scenarios with respect to product sales (in
>> units sold per month). My session info is at the bottom of this note.
>>
>> The steps include (1) an initial linear regression model, (2) building
>> an ARIMA model, (3) creating 12 months of simulated 'real' data - for
>> various values of projected changes in slope from the linear
>> regression - from the ARIMA model using arima.sim with subsequent
>> undifferencing and appending to historical data, (3) one-step-ahead
>> forecasting for 12 months using the 'fixed' arima model and simulated
>> data, (4) taking the residuals from the forecasting (simulated minus
>> forecast for each of the 12 months) and applying QC charting, and (5)
>> counting the number (if any) of runs-rules violations detected.
>>
>> The simulation-aspect calculates 100 simulations for each of 50 values of
>> slope.
>>
>> All of this seems to work fine (although I'm sure that the coding
>> could be improved through functions, vectorization, etc. ... ).
>> However, the problem I'm having is at the end where I had hoped that
>> things would be easier. I want to summarize and graph the probability
>> of detecting a runs-rule violation vs. the amount of the shift in
>> slope (of logunit sales).
>>
>> The output data array passed from the qcc section at the end includes:
>> - the adjustment made to the slope (a multiplier)
>> - the actual value of the slope
>> - the iteration number of the simulation loop (within each value of
>> slope)
>> - the count of QC charting limits violations
>> - the count of QC charting runs rules violations
>>
>>
>> My code is in the attached file ("generic_code.R) and my initial sales
>> data needed to "prime" the simulation is in the other attached file
>> ("generic_data.csv"). The relevant section of my code is at the
>> bottom of the .R file after the end of the outer loop. I've tried to
>> provide meaningful comments.
>>
>> I've read the help files for _apply, aggregate, arrays and data types,
>> and have also consulted with several texts (Maindonald and Braun;
>> Spector; Venebles and Ripley for S-plus). Somehow I still don't get
>> it. My attempts usually result in a message like the following:
>>
>>> agg.result = aggregate(simarray.part[,3], by=list4, FUN = mean)
>>
>> Error in FUN(X[[1L]], ...) : arguments must have same length
>
> I cannot comment on the overall strategy, but wondered if this minor mod to
> the code might succeed;
>
>>> agg.result = aggregate(simarray.part[,3], by=list(list4), FUN = mean)
>
> My personal experience with aggregate() is not positive. I generally end up
> turning to tapply() (which is at the heart of aggregate() anyway) probably
> because I forget to wrap the second argument in a list. Slow learner, I
> guess.
>
>
>>
>> But when I check the length of the arguments, they appear to match. (??)
>>
>>> length(simarray.part[,3])
>>
>> [1] 5000
>>>
>>> length(simarray.part[,4])
>>
>> [1] 5000
>>>
>>> length(list4)
>>
>> [1] 5000
>>
>>
>> I would have rather passed along a subset of the simulation/loop
>> output dataset, but was unsure how to save it to preserve whatever
>> properties that I may have missed that are causing my difficulties.
>> If anyone still wants to help at this point, I believe that you'll
>> need to load the .csv data and run the simulation (maybe reducing the
>> number of iterations).
>>
>> Many thanks to anyone who can shed some light on my difficulties
>> (whether self-induced or otherwise).
>>
>> Cliff
>>
>>
>>
>> I'm using a pre-compiled binary version of R for Windows.
>>
>> Session info:
>>
>>> sessionInfo()
>>
>> R version 2.9.0 (2009-04-17)
>> i386-pc-mingw32
>>
>> locale:
>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>> States.1252;LC_MONETARY=English_United
>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] qcc_1.3 forecast_1.24 tseries_0.10-18 zoo_1.5-5
>> [5] quadprog_1.4-11
>>
>> loaded via a namespace (and not attached):
>> [1] grid_2.9.0 lattice_0.17-22
>>
>>
>>> Sys.getlocale()
>>
>> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>> States.1252;LC_MONETARY=English_United
>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
>
More information about the R-help
mailing list