[R] Newbie struggling with "factors"

Frank E Harrell Jr fharrell at virginia.edu
Sat Mar 30 02:27:46 CET 2002


The Hmisc library has a multiple choice class that uses the matrix storage approach.  I only do anything with this class in the summary.formula function for table making.  The library is ready for beta testing for Linux/Unix users and soon for Windows.  I will make a full announcement when the Windows port is ready.

You may obtain the Linux/Unix package from http://hesweb1.med.virginia.edu/biostat/s/Hmisc.html

For examples of using summary.formula for handling multiple choice data see
http://hesweb1.med.virginia.edu/biostat/s/help/Hmisc/html/summary.formula.html

Frank Harrell

On Fri, 29 Mar 2002 14:17:57 -0800 (PST)
Tom Arnold <thomas_l_arnold at yahoo.com> wrote:

> To all who have offered suggestions: 
> THANKS! Wow, this list has generated a lot of good
> ideas for me in a very short time, and I appreciate
> it.
> 
> For now, I've got some solutions to my problem. Greg's
> suggestion about creating a subclass to handle the
> "multi-checkbox" type of question is probably the most
> flexible, in the long run. However, I've not chosen it
> in the short run because my programming experience is
> deeper in the procedural vein than in OOP. I'm only
> just starting to see how the OO qualities of R can be
> used, and I'm not yet comfortable in coding that way.
> 
> Following the suggestions of several people on the
> list, I have created a few functions that proceed this
> way for my multi-choice questions:
> - create a matrix with as many rows as there are
> responses, and as many columns as there are
> "checkboxes" in the original question
> 
> - use strsplit to break up the factors based on the
> separator inside the field
> 
> - for each column in the matrix that I created, fill
> it with T/F (1/0) by using the is.element function to
> determine which responses had each checkbox checked
> 
> - use the resulting matrix to create whatever sums,
> averages and plots I want
> 
> The code I wrote is not pretty, but is working for me
> at the moment. I'm an old assembly and C programmer
> mainly, so I'm still getting used to the capabilities
> and idioms of R. I think my code does great violence
> to both and probably makes the interpreter thrash
> pitifully, but for now it seems to produce the correct
> result and I can understand it! I'll look for elegance
> as I go along.
> --- "Warnes, Gregory R"
> <gregory_r_warnes at groton.pfizer.com> wrote:
> > 
> > Hint #1,  to do any useful transformations on your
> > variables you will
> > probably need to convert them temporarily into
> > character variables (aka
> > strings).  Do that with 
> > 
> > 	as.character(n$OSUSE)
> > 
> > Probably your will want to convert each of the
> > variables that are in this
> > format into a set of numeric variables.  Something
> > like this:
> > 
> > 	n <- data.frame(OSUSE = c("1","1,3","1,2,3"))	
> > 	n$OSUSE.Windows   <- sapply( strsplit(n$OSUSE, ",")
> > , function(X) (
> > "1" %in% X ) )
> > 	n$OSUSE.Macintosh <- sapply( strsplit(n$OSUSE, ",")
> > , function(X) (
> > "2" %in% X ) )
> >      	n$OSUSE.Unix      <- sapply( strsplit(n$OSUSE,
> > ",") , function(X) (
> > "3" %in% X ) )
> > 
> > Alternatively, if you often have variables like
> > this, you might consider
> > creating a new object type that extends factor and
> > that includes the
> > operations that you need.  
> > 
> > Something like:
> > 
> > ### Start Sample Code ###
> > 
> > checklist <- function(X, boxnames)
> >   {
> >     attr(X, "boxnames") <- boxnames
> >     class(X) <- c("checklist","factor")
> >     return(X)
> >   }
> > 
> > contains <- function(X, name)
> >   {
> >     if(is.character(name) )
> >       name <- pmatch( name, attr(X,"boxnames" ) )
> >                      
> >     retval <- sapply( strsplit(X, ",") , function(X)
> > ( name %in% X ) )
> >     return(retval)
> >   }
> > 
> > numchecked <- function(X)
> >   {
> >     retval <- sapply( strsplit(X, ","), length )
> >     return(retval)
> >   }
> > 
> > summary.checklist <- function(x, ...)
> >   {
> >     sum <- apply( as.matrix(x), 2, sum )
> >     mean <- apply( as.matrix(x), 2, mean )
> >     return( rbind(sum,mean))
> >   }
> > 
> > as.matrix.checklist <- function(x, ...)
> >   {
> >     sapply( attr(x, "boxnames"), function(YY)
> > contains(x, YY) )
> >   }
> > 
> > ### End Sample Code ##
> > 
> > Here's some examples of using these functions:
> > 
> > > n <- data.frame(OSUSE = c("1","1,3","1,2,3"))
> > > 
> > > n$OSUSE <- checklist(n$OSUSE,
> > c("Windows","Macintosh","Unix"))
> > #
> > # Check if OSUSE includes a specific OS
> > #
> > > contains( n$OSUSE, "Windows")
> > [1] TRUE TRUE TRUE
> > > contains( n$OSUSE, "Macintosh")
> > [1] FALSE FALSE  TRUE
> > > contains( n$OSUSE, "Unix")
> > [1] FALSE  TRUE  TRUE
> > >
> > #
> > # Compute the average number of checked items
> > # 
> > > numchecked(n$OSUSE)
> > [1] 1 2 3
> > > mean(numchecked(n$OSUSE))
> > [1] 2
> > > 
> > #
> > # Create a matrix showing whether each box was
> > checked or not
> > #
> > > as.matrix(n$OSUSE)
> >      Windows Macintosh  Unix
> > [1,]    TRUE     FALSE FALSE
> > [2,]    TRUE     FALSE  TRUE
> > [3,]    TRUE      TRUE  TRUE
> > > 
> > #
> > # Show some summary info
> > #
> > > summary(n$OSUSE)
> >      Windows Macintosh      Unix
> > sum        3 1.0000000 2.0000000
> > mean       1 0.3333333 0.6666667		
> > 
> > 
> > Of course, you'll want to modify these classes to
> > suit your needs.  A little
> > time up front can help a lot.
> > 
> > If you like, I'll include these classes and any
> > enhancements that you make
> > in my 'gregmisc' library.
> > 
> > 
> > -Greg
> > 
> > 
> > > -----Original Message-----
> > > From: Tom Arnold
> > [mailto:thomas_l_arnold at yahoo.com]
> > > Sent: Friday, March 29, 2002 8:59 AM
> > > To: R
> > > Subject: [R] Newbie struggling with "factors"
> > > 
> > > 
> > > I am processing some survey results, and my data
> > are
> > > being read in as "factors". I don't know how to
> > > process these things in any way.
> > > 
> > > To start with, several of the survey questions are
> > > mulit-choice check boxes on the original
> > (web-based)
> > > survey, as in "check all that apply".
> > > 
> > > These are encoded as numbers. For example, if the
> > > survey has a question:
> > > Which operating systems have you used? (Check all
> > that
> > > apply)
> > > [ ]Windows
> > > [ ]Macinotsh
> > > [ ]Unix
> > > 
> > > ...then the data exported for three different
> > > responses might look like
> > > ;1;
> > > ;1,3;
> > > ;1,2,3;
> > > 
> > > ...where ";" is the field delimiter. 
> > > I use read.table to get the data in. I read all
> > the
> > > survey data into a table "n" and the field above
> > is
> > > called "OSUSE". When I query R about the field, it
> > > tells me it is class "factor"
> > > 
> > > > class(n$OSUSE)
> > > [1] "factor"
> > > > mode(n$OSUSE)
> > > [1] "numeric"
> > > 
> > > I'd like to be able to do some simple things like:
> > > what is the most common item checked (1, 2, or 3?)
> > > What is the average number of boxes checked?
> > > 
> > > But I can't find any way to manipulate this
> > "factor"
> > > field. What's the secret?
> > > 
> > > Thanks.
> > > 
> > > =====
> > > Tom Arnold
> > > Summit Media Partners
> > > Visit our web site at
> > http://www.summitmediapartners.com
> > > 
> > > __________________________________________________
> > > 
> > > Yahoo! Greetings - send holiday greetings for
> > Easter, Passover
> > > 
> > >
> >
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > > -.-.-.-.-.-.-.-.-
> > > r-help mailing list -- Read 
> > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > > Send "info", "help", or "[un]subscribe"
> > 
> === message truncated ===
> 
> 
> =====
> Tom Arnold
> Summit Media Partners
> Visit our web site at http://www.summitmediapartners.com
> 
> __________________________________________________
> 
> Yahoo! Greetings - send holiday greetings for Easter, Passover
> 
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._


-- 
Frank E Harrell Jr              Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine  http://hesweb1.med.virginia.edu/biostat
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list