[R] Casting lists to data.frames, analog to SAS
Matthew Pettis
matthew.pettis at gmail.com
Wed Jan 14 22:46:30 CET 2009
Thank you very much -- this was very helpful for differentiating among
the aggregating methods!
Matt
On Wed, Jan 14, 2009 at 3:42 PM, Marc Schwartz
<marc_schwartz at comcast.net> wrote:
> on 01/14/2009 02:51 PM Matthew Pettis wrote:
>> I have a specific question and a general question.
>>
>> Specific Question: I want to do an analysis on a data frame by 2 or more
>> class variables (i.e., use 2 or more columns in a dataframe to do
>> statistical classing). Coming from SAS, I'm used to being able to take a
>> data set and have the output of the analysis in a dataset for further
>> manipulation. I have a data set with vote totals, with one column being the
>> office name being voted on, and the other being the party of the candidate.
>> My votes are in the column "vc.n". I did the analysis I want with:
>>
>> work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)
>>
>> the str() output of work looks like:
>>
>>> str(work)
>> 'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
>> - attr(*, "dimnames")=List of 2
>> ..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT GOVERNOR"
>> "SECRETARY OF STATE" "STATE AUDITOR" ...
>> ..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
>> - attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES =
>> sd62[, c("office.nm", "party.abbr")], FUN = sum)
>>
>>
>>
>>
>> work is now a list. I'd really like to have work be a data frame with 3
>> columns: The rows of the first two columns show the office and party levels
>> being considered, and the third being the sum of the votes for that level
>> combination. How do I cast this list/output into a data frame? using
>> 'as.data.frame' doesn't work.
>>
>> General Question: I assume the answer to the specific question is dependent
>> on my understanding list objects and accessing their attributes. Can anyone
>> point me to a good, throrough treatment of these R topics? Specifically how
>> to read and interpret the output of the str(), and attributes() function,
>> how to extract the values of the 'by' output object into a data frame, etc.?
>>
>> Thanks,
>> Matt
>
> Matt,
>
> Welcome to R.
>
> The help pages for each function, while they can be intentionally terse,
> are a good first place to look. Many will include links/references to
> related sources.
>
> "An Introduction to R" is a good general place to start. A more thorough
> treatment is in the "R Language Definition" manual. There are also a
> plethora of contributed documents:
>
> http://cran.r-project.org/other-docs.html
>
> and books on R and using R within specific domains:
>
> http://www.r-project.org/doc/bib/R-books.html
>
>
> There are (at least) three ways to generate summary statistics based
> upon multi-level groupings. These include by(), tapply() and aggregate().
>
> The key difference between the three is the class/structure of the
> results object and the print (output) method. In the specific case of
> aggregate(), it must also return a scalar. Thus for example, unlike with
> by() and tapply(), you cannot use summary(), which returns multiple values.
>
> Thus the choice for which approach to take, to an extent, is founded on
> what you may subsequently do with the data.
>
> As an example, using the same set of data (warpbreaks):
>
>> str(warpbreaks)
> 'data.frame': 54 obs. of 3 variables:
> $ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
> $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
> $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
>
>
> # Use by()
>
>> by(warpbreaks[, 1],
> list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum)
> wool: A
> tension: L
> [1] 401
> ------------------------------------------------------
> wool: B
> tension: L
> [1] 254
> ------------------------------------------------------
> wool: A
> tension: M
> [1] 216
> ------------------------------------------------------
> wool: B
> tension: M
> [1] 259
> ------------------------------------------------------
> wool: A
> tension: H
> [1] 221
> ------------------------------------------------------
> wool: B
> tension: H
> [1] 169
>
>
>
> Note, because the result of using by() is at its core, a matrix/table,
> you can also do the following, explicitly using the print method for a
> table:
>
>> print.table(by(warpbreaks[, 1],
> list(wool = warpbreaks$wool,
> tension = warpbreaks$tension), sum))
> tension
> wool L M H
> A 401 216 221
> B 254 259 169
>
>
> which gives you printed output in the same format as tapply() below,
> without altering the structure of the result itself.
>
>
> # tapply() directly gives you a tabular output
>
>> tapply(warpbreaks[, 1],
> list(wool = warpbreaks$wool, tension = warpbreaks$tension),
> sum)
> tension
> wool L M H
> A 401 216 221
> B 254 259 169
>
>
>
> Note that the structure of the result from by() and the result from
> tapply() are quite similar:
>
>> str(by(warpbreaks[, 1],
> list(wool = warpbreaks$wool, tension = warpbreaks$tension),
> sum))
> by [1:2, 1:3] 401 254 216 259 221 169
> - attr(*, "dimnames")=List of 2
> ..$ wool : chr [1:2] "A" "B"
> ..$ tension: chr [1:3] "L" "M" "H"
> - attr(*, "call")= language by.default(data = warpbreaks[, 1], INDICES
> = list(wool = warpbreaks$wool, tension = warpbreaks$tension), FUN =
> sum)
>
>
>> str(tapply(warpbreaks[, 1],
> list(wool = warpbreaks$wool, tension = warpbreaks$tension),
> sum))
> num [1:2, 1:3] 401 254 216 259 221 169
> - attr(*, "dimnames")=List of 2
> ..$ wool : chr [1:2] "A" "B"
> ..$ tension: chr [1:3] "L" "M" "H"
>
>
> Both are at their core, a 2 x 3 matrix.
>
> The key difference is in the 'class' of the result, which affects
> subsequent operations, such as the print method used.
>
>
>
> # aggregate() gives you a data frame, with the summary statistic as the
> # 'x' column
>
>> aggregate(warpbreaks[, 1],
> list(wool = warpbreaks$wool, tension = warpbreaks$tension),
> sum)
> wool tension x
> 1 A L 401
> 2 B L 254
> 3 A M 216
> 4 B M 259
> 5 A H 221
> 6 B H 169
>
>
>> str(aggregate(warpbreaks[, 1],
> list(wool = warpbreaks$wool, tension = warpbreaks$tension),
> sum))
> 'data.frame': 6 obs. of 3 variables:
> $ wool : Factor w/ 2 levels "A","B": 1 2 1 2 1 2
> $ tension: Factor w/ 3 levels "L","M","H": 1 1 2 2 3 3
> $ x : num 401 254 216 259 221 169
>
>
> Thus, bottom line, given your intended application, I would suggest
> using aggregate() rather than by().
>
> HTH,
>
> Marc Schwartz
>
--
One of the penalties for refusing to participate in politics is that
you end up being governed by your inferiors.
-- Plato
It is from the wellspring of our despair and the places that we are
broken that we come to repair the world.
-- Murray Waas
More information about the R-help
mailing list