[R] Operating on count lists of non-equal lengths
Kari Manninen
kari at econadvisor.com
Mon Jan 10 06:01:06 CET 2011
Dear Dennis,
Thank you so very much for your quick reply. What an introduction to
R-help!! Especially I appreciated the time you put to explain the code
privately.
After a few hick-ups I got it working on my data as well.
Regards,
- Kari
Quoting Dennis Murphy <djmuser at gmail.com>:
> Hi:
>
> This is an abridged version of the reply I sent privately to the OP.
>
> #### Generate an artificial data frame
> # function to randomly generate one of the Q* columns with length 1000
> mysamp <- function() sample(c(-1, 0, 1, NA), 1000, prob = c(0.35, 0.2, 0.4,
> 0.05), replace = TRUE)
>
> # use above function to randomly generate 10 questions and assign them names
> in the workspace
> for(i in 1:10) assign(paste('Q', i, sep = ''), mysamp())
> # create a data frame from the generate questions
> C <- data.frame(time = rep(1:4, each = 250),
> sector = sample(LETTERS[1:6], 1000, replace = TRUE),
> Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10)
> ####
>
> # A function to generate the scores from the combined questions
> # for an arbitrary input data frame d:
> scorefun <- function(d) {
> dm <- matrix(unlist(apply(d, 2, table)[-(1:2)]), nrow = 3)
> tsums <- cbind(rowSums(dm[, 1:3]), dm[, 4],
> rowSums(dm[, 5:6]), rowSums(dm[, 7:8]),
> rowSums(dm[, 9:10]) )
> dprop <- function(x) (x[3] - x[1])/sum(x)
> 100 * (1 + apply(tsums, 2, dprop))
> }
>
> library(plyr)
> # Apply scorefun() to each sub-data frame corresponding to time-sector
> combinations
> ddply(C, .(time, sector), scorefun)
>
> Dennis
>
> On Sat, Jan 8, 2011 at 10:19 PM, Kari Manninen <kari at econadvisor.com> wrote:
>
>> This is my first post to R-help and I look forward receiving some advice
>> for a novice like me...
>>
>> I’ve got a simple repeated (4 periods so far) 10-question survey data that
>> is very easy to work on Excel. However, I’d like to move the compilation to
>> R but I’m having some trouble operating on count list data in a neat way.
>>
>> The data C
>>
>>> str(C)
>>>
>> 'data.frame': 551 obs. of 13 variables:
>> $ TIME : int 1 1 1 1 1 1 1 1 1 1 ...
>> $ Sector : Factor w/ 6 levels "D","F","G","H",..: 1 1 1 1 1 1 1 1 1 1 ...
>> $ COMP : Factor w/ 196 levels " (_____ __ _____) ",..: 73 133 128 109
>> 153 147 56 26 142 34 ...
>> $ Q1 : int 0 0 1 1 0 -1 -1 1 1 -1 ...
>> $ Q2 : int 0 0 0 -1 0 -1 0 0 1 -1 ...
>> $ Q3 : int 0 0 0 1 0 -1 -1 1 1 -1 ...
>> $ Q4 : int -1 0 0 0 0 -1 0 -1 0 -1 ...
>> $ Q5 : int 0 0 0 -1 0 -1 0 -1 0 0 ...
>> $ Q6 : int 0 0 0 1 0 -1 0 -1 0 0 ...
>> $ Q7 : int 0 1 1 0 0 0 1 0 1 1 ...
>> $ Q8 : int 0 0 0 0 0 -1 0 0 1 0 ...
>> $ Q9 : int 0 1 0 0 0 -1 0 -1 1 -1 ...
>> $ Q10 : int 0 0 0 0 -1 -1 0 -1 0 0 ...
>>
>> summary(C)
>>>
>> TIME Sector COMP Q1 Q2
>> Min. :1.000 D:130 A: 4 Min. :-1.000 Min. :-1.0000
>> 1st Qu.:2.000 F:126 B: 4 1st Qu.: 0.000 1st Qu.: 0.0000
>> Median :3.000 G:158 C: 4 Median : 1.000 Median : 0.0000
>> Mean :2.684 H: 26 D: 4 Mean : 0.446 Mean : 0.2178
>> 3rd Qu.:4.000 I: 20 E: 4 3rd Qu.: 1.000 3rd Qu.: 1.0000
>> Max. :4.000 J: 91 F: 4 Max. : 1.000 Max. : 1.0000
>> (Other):527 NA's :60.000 NA's :69.0000
>>
>>
>> The aim is to produce balance scores between positive and negative answers’
>> shares in the data. First counts of -1, 0 and 1 (negative, neutral,
>> positive) and missing NA (it would be som much simple without the missing
>> values) for each question Q1-Q10 for each period (TIME) in 6 Sectors:
>>
>> b<-apply(C[,4:13], 2, function (x) tapply(x,C[,1:2], count))
>>
>> I know that b is a list of data.frames dim(4x6) for each question, where
>> each ‘cell’ is a count list.
>>
>> For example, for Question 1, Time period 2, Sector 1:
>>
>>> str(b$Q1[2,1])
>>>
>> List of 1
>> $ :’data.frame’: 4 obs. of 2 variables:
>> ..$ x : int [1:4] -1 0 1 NA
>> ..$ freq : int [1:4] 3 9 12 2
>>
>> Now I would like to group questions (C[, 4:6], C[, 7], C[8:9], C[10:11]
>> and C[, 12:13]) and sum counts (-1, 0, 1) for these groups and present
>> them in percentage terms. I don’t know how to this efficiently for the whole
>> data. I would not like to go through each cell separately
>>
>> Then I’d give each group a balance score based on something like:
>>
>> Score = 100 + 100*[ pos% - neg%] for each group by TIME, Sector, while
>> excluding the missing observations.
>>
>> ### This is not working
>> Score <- 100 + 100*[sum(count( =="1")/sum(count(list( "-1", "0","1") -
>> sum(count( =="-1")/sum(count(list( "-1", "0","1")] for each 5 groups
>> defined above and by TIME, Sector
>>
>> I would greatly appreciate your help on this.
>>
>> Regards,
>> - Kari Manninen
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
More information about the R-help
mailing list