[R] Newbie struggling with "factors"
Frank E Harrell Jr
fharrell at virginia.edu
Sat Mar 30 02:27:46 CET 2002
The Hmisc library has a multiple choice class that uses the matrix storage approach. I only do anything with this class in the summary.formula function for table making. The library is ready for beta testing for Linux/Unix users and soon for Windows. I will make a full announcement when the Windows port is ready.
You may obtain the Linux/Unix package from http://hesweb1.med.virginia.edu/biostat/s/Hmisc.html
For examples of using summary.formula for handling multiple choice data see
http://hesweb1.med.virginia.edu/biostat/s/help/Hmisc/html/summary.formula.html
Frank Harrell
On Fri, 29 Mar 2002 14:17:57 -0800 (PST)
Tom Arnold <thomas_l_arnold at yahoo.com> wrote:
> To all who have offered suggestions:
> THANKS! Wow, this list has generated a lot of good
> ideas for me in a very short time, and I appreciate
> it.
>
> For now, I've got some solutions to my problem. Greg's
> suggestion about creating a subclass to handle the
> "multi-checkbox" type of question is probably the most
> flexible, in the long run. However, I've not chosen it
> in the short run because my programming experience is
> deeper in the procedural vein than in OOP. I'm only
> just starting to see how the OO qualities of R can be
> used, and I'm not yet comfortable in coding that way.
>
> Following the suggestions of several people on the
> list, I have created a few functions that proceed this
> way for my multi-choice questions:
> - create a matrix with as many rows as there are
> responses, and as many columns as there are
> "checkboxes" in the original question
>
> - use strsplit to break up the factors based on the
> separator inside the field
>
> - for each column in the matrix that I created, fill
> it with T/F (1/0) by using the is.element function to
> determine which responses had each checkbox checked
>
> - use the resulting matrix to create whatever sums,
> averages and plots I want
>
> The code I wrote is not pretty, but is working for me
> at the moment. I'm an old assembly and C programmer
> mainly, so I'm still getting used to the capabilities
> and idioms of R. I think my code does great violence
> to both and probably makes the interpreter thrash
> pitifully, but for now it seems to produce the correct
> result and I can understand it! I'll look for elegance
> as I go along.
> --- "Warnes, Gregory R"
> <gregory_r_warnes at groton.pfizer.com> wrote:
> >
> > Hint #1, to do any useful transformations on your
> > variables you will
> > probably need to convert them temporarily into
> > character variables (aka
> > strings). Do that with
> >
> > as.character(n$OSUSE)
> >
> > Probably your will want to convert each of the
> > variables that are in this
> > format into a set of numeric variables. Something
> > like this:
> >
> > n <- data.frame(OSUSE = c("1","1,3","1,2,3"))
> > n$OSUSE.Windows <- sapply( strsplit(n$OSUSE, ",")
> > , function(X) (
> > "1" %in% X ) )
> > n$OSUSE.Macintosh <- sapply( strsplit(n$OSUSE, ",")
> > , function(X) (
> > "2" %in% X ) )
> > n$OSUSE.Unix <- sapply( strsplit(n$OSUSE,
> > ",") , function(X) (
> > "3" %in% X ) )
> >
> > Alternatively, if you often have variables like
> > this, you might consider
> > creating a new object type that extends factor and
> > that includes the
> > operations that you need.
> >
> > Something like:
> >
> > ### Start Sample Code ###
> >
> > checklist <- function(X, boxnames)
> > {
> > attr(X, "boxnames") <- boxnames
> > class(X) <- c("checklist","factor")
> > return(X)
> > }
> >
> > contains <- function(X, name)
> > {
> > if(is.character(name) )
> > name <- pmatch( name, attr(X,"boxnames" ) )
> >
> > retval <- sapply( strsplit(X, ",") , function(X)
> > ( name %in% X ) )
> > return(retval)
> > }
> >
> > numchecked <- function(X)
> > {
> > retval <- sapply( strsplit(X, ","), length )
> > return(retval)
> > }
> >
> > summary.checklist <- function(x, ...)
> > {
> > sum <- apply( as.matrix(x), 2, sum )
> > mean <- apply( as.matrix(x), 2, mean )
> > return( rbind(sum,mean))
> > }
> >
> > as.matrix.checklist <- function(x, ...)
> > {
> > sapply( attr(x, "boxnames"), function(YY)
> > contains(x, YY) )
> > }
> >
> > ### End Sample Code ##
> >
> > Here's some examples of using these functions:
> >
> > > n <- data.frame(OSUSE = c("1","1,3","1,2,3"))
> > >
> > > n$OSUSE <- checklist(n$OSUSE,
> > c("Windows","Macintosh","Unix"))
> > #
> > # Check if OSUSE includes a specific OS
> > #
> > > contains( n$OSUSE, "Windows")
> > [1] TRUE TRUE TRUE
> > > contains( n$OSUSE, "Macintosh")
> > [1] FALSE FALSE TRUE
> > > contains( n$OSUSE, "Unix")
> > [1] FALSE TRUE TRUE
> > >
> > #
> > # Compute the average number of checked items
> > #
> > > numchecked(n$OSUSE)
> > [1] 1 2 3
> > > mean(numchecked(n$OSUSE))
> > [1] 2
> > >
> > #
> > # Create a matrix showing whether each box was
> > checked or not
> > #
> > > as.matrix(n$OSUSE)
> > Windows Macintosh Unix
> > [1,] TRUE FALSE FALSE
> > [2,] TRUE FALSE TRUE
> > [3,] TRUE TRUE TRUE
> > >
> > #
> > # Show some summary info
> > #
> > > summary(n$OSUSE)
> > Windows Macintosh Unix
> > sum 3 1.0000000 2.0000000
> > mean 1 0.3333333 0.6666667
> >
> >
> > Of course, you'll want to modify these classes to
> > suit your needs. A little
> > time up front can help a lot.
> >
> > If you like, I'll include these classes and any
> > enhancements that you make
> > in my 'gregmisc' library.
> >
> >
> > -Greg
> >
> >
> > > -----Original Message-----
> > > From: Tom Arnold
> > [mailto:thomas_l_arnold at yahoo.com]
> > > Sent: Friday, March 29, 2002 8:59 AM
> > > To: R
> > > Subject: [R] Newbie struggling with "factors"
> > >
> > >
> > > I am processing some survey results, and my data
> > are
> > > being read in as "factors". I don't know how to
> > > process these things in any way.
> > >
> > > To start with, several of the survey questions are
> > > mulit-choice check boxes on the original
> > (web-based)
> > > survey, as in "check all that apply".
> > >
> > > These are encoded as numbers. For example, if the
> > > survey has a question:
> > > Which operating systems have you used? (Check all
> > that
> > > apply)
> > > [ ]Windows
> > > [ ]Macinotsh
> > > [ ]Unix
> > >
> > > ...then the data exported for three different
> > > responses might look like
> > > ;1;
> > > ;1,3;
> > > ;1,2,3;
> > >
> > > ...where ";" is the field delimiter.
> > > I use read.table to get the data in. I read all
> > the
> > > survey data into a table "n" and the field above
> > is
> > > called "OSUSE". When I query R about the field, it
> > > tells me it is class "factor"
> > >
> > > > class(n$OSUSE)
> > > [1] "factor"
> > > > mode(n$OSUSE)
> > > [1] "numeric"
> > >
> > > I'd like to be able to do some simple things like:
> > > what is the most common item checked (1, 2, or 3?)
> > > What is the average number of boxes checked?
> > >
> > > But I can't find any way to manipulate this
> > "factor"
> > > field. What's the secret?
> > >
> > > Thanks.
> > >
> > > =====
> > > Tom Arnold
> > > Summit Media Partners
> > > Visit our web site at
> > http://www.summitmediapartners.com
> > >
> > > __________________________________________________
> > >
> > > Yahoo! Greetings - send holiday greetings for
> > Easter, Passover
> > >
> > >
> >
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > > -.-.-.-.-.-.-.-.-
> > > r-help mailing list -- Read
> > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > > Send "info", "help", or "[un]subscribe"
> >
> === message truncated ===
>
>
> =====
> Tom Arnold
> Summit Media Partners
> Visit our web site at http://www.summitmediapartners.com
>
> __________________________________________________
>
> Yahoo! Greetings - send holiday greetings for Easter, Passover
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
--
Frank E Harrell Jr Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list