[R] Newbie struggling with "factors"

Tom Arnold thomas_l_arnold at yahoo.com
Fri Mar 29 23:17:57 CET 2002


To all who have offered suggestions: 
THANKS! Wow, this list has generated a lot of good
ideas for me in a very short time, and I appreciate
it.

For now, I've got some solutions to my problem. Greg's
suggestion about creating a subclass to handle the
"multi-checkbox" type of question is probably the most
flexible, in the long run. However, I've not chosen it
in the short run because my programming experience is
deeper in the procedural vein than in OOP. I'm only
just starting to see how the OO qualities of R can be
used, and I'm not yet comfortable in coding that way.

Following the suggestions of several people on the
list, I have created a few functions that proceed this
way for my multi-choice questions:
- create a matrix with as many rows as there are
responses, and as many columns as there are
"checkboxes" in the original question

- use strsplit to break up the factors based on the
separator inside the field

- for each column in the matrix that I created, fill
it with T/F (1/0) by using the is.element function to
determine which responses had each checkbox checked

- use the resulting matrix to create whatever sums,
averages and plots I want

The code I wrote is not pretty, but is working for me
at the moment. I'm an old assembly and C programmer
mainly, so I'm still getting used to the capabilities
and idioms of R. I think my code does great violence
to both and probably makes the interpreter thrash
pitifully, but for now it seems to produce the correct
result and I can understand it! I'll look for elegance
as I go along.
--- "Warnes, Gregory R"
<gregory_r_warnes at groton.pfizer.com> wrote:
> 
> Hint #1,  to do any useful transformations on your
> variables you will
> probably need to convert them temporarily into
> character variables (aka
> strings).  Do that with 
> 
> 	as.character(n$OSUSE)
> 
> Probably your will want to convert each of the
> variables that are in this
> format into a set of numeric variables.  Something
> like this:
> 
> 	n <- data.frame(OSUSE = c("1","1,3","1,2,3"))	
> 	n$OSUSE.Windows   <- sapply( strsplit(n$OSUSE, ",")
> , function(X) (
> "1" %in% X ) )
> 	n$OSUSE.Macintosh <- sapply( strsplit(n$OSUSE, ",")
> , function(X) (
> "2" %in% X ) )
>      	n$OSUSE.Unix      <- sapply( strsplit(n$OSUSE,
> ",") , function(X) (
> "3" %in% X ) )
> 
> Alternatively, if you often have variables like
> this, you might consider
> creating a new object type that extends factor and
> that includes the
> operations that you need.  
> 
> Something like:
> 
> ### Start Sample Code ###
> 
> checklist <- function(X, boxnames)
>   {
>     attr(X, "boxnames") <- boxnames
>     class(X) <- c("checklist","factor")
>     return(X)
>   }
> 
> contains <- function(X, name)
>   {
>     if(is.character(name) )
>       name <- pmatch( name, attr(X,"boxnames" ) )
>                      
>     retval <- sapply( strsplit(X, ",") , function(X)
> ( name %in% X ) )
>     return(retval)
>   }
> 
> numchecked <- function(X)
>   {
>     retval <- sapply( strsplit(X, ","), length )
>     return(retval)
>   }
> 
> summary.checklist <- function(x, ...)
>   {
>     sum <- apply( as.matrix(x), 2, sum )
>     mean <- apply( as.matrix(x), 2, mean )
>     return( rbind(sum,mean))
>   }
> 
> as.matrix.checklist <- function(x, ...)
>   {
>     sapply( attr(x, "boxnames"), function(YY)
> contains(x, YY) )
>   }
> 
> ### End Sample Code ##
> 
> Here's some examples of using these functions:
> 
> > n <- data.frame(OSUSE = c("1","1,3","1,2,3"))
> > 
> > n$OSUSE <- checklist(n$OSUSE,
> c("Windows","Macintosh","Unix"))
> #
> # Check if OSUSE includes a specific OS
> #
> > contains( n$OSUSE, "Windows")
> [1] TRUE TRUE TRUE
> > contains( n$OSUSE, "Macintosh")
> [1] FALSE FALSE  TRUE
> > contains( n$OSUSE, "Unix")
> [1] FALSE  TRUE  TRUE
> >
> #
> # Compute the average number of checked items
> # 
> > numchecked(n$OSUSE)
> [1] 1 2 3
> > mean(numchecked(n$OSUSE))
> [1] 2
> > 
> #
> # Create a matrix showing whether each box was
> checked or not
> #
> > as.matrix(n$OSUSE)
>      Windows Macintosh  Unix
> [1,]    TRUE     FALSE FALSE
> [2,]    TRUE     FALSE  TRUE
> [3,]    TRUE      TRUE  TRUE
> > 
> #
> # Show some summary info
> #
> > summary(n$OSUSE)
>      Windows Macintosh      Unix
> sum        3 1.0000000 2.0000000
> mean       1 0.3333333 0.6666667		
> 
> 
> Of course, you'll want to modify these classes to
> suit your needs.  A little
> time up front can help a lot.
> 
> If you like, I'll include these classes and any
> enhancements that you make
> in my 'gregmisc' library.
> 
> 
> -Greg
> 
> 
> > -----Original Message-----
> > From: Tom Arnold
> [mailto:thomas_l_arnold at yahoo.com]
> > Sent: Friday, March 29, 2002 8:59 AM
> > To: R
> > Subject: [R] Newbie struggling with "factors"
> > 
> > 
> > I am processing some survey results, and my data
> are
> > being read in as "factors". I don't know how to
> > process these things in any way.
> > 
> > To start with, several of the survey questions are
> > mulit-choice check boxes on the original
> (web-based)
> > survey, as in "check all that apply".
> > 
> > These are encoded as numbers. For example, if the
> > survey has a question:
> > Which operating systems have you used? (Check all
> that
> > apply)
> > [ ]Windows
> > [ ]Macinotsh
> > [ ]Unix
> > 
> > ...then the data exported for three different
> > responses might look like
> > ;1;
> > ;1,3;
> > ;1,2,3;
> > 
> > ...where ";" is the field delimiter. 
> > I use read.table to get the data in. I read all
> the
> > survey data into a table "n" and the field above
> is
> > called "OSUSE". When I query R about the field, it
> > tells me it is class "factor"
> > 
> > > class(n$OSUSE)
> > [1] "factor"
> > > mode(n$OSUSE)
> > [1] "numeric"
> > 
> > I'd like to be able to do some simple things like:
> > what is the most common item checked (1, 2, or 3?)
> > What is the average number of boxes checked?
> > 
> > But I can't find any way to manipulate this
> "factor"
> > field. What's the secret?
> > 
> > Thanks.
> > 
> > =====
> > Tom Arnold
> > Summit Media Partners
> > Visit our web site at
> http://www.summitmediapartners.com
> > 
> > __________________________________________________
> > 
> > Yahoo! Greetings - send holiday greetings for
> Easter, Passover
> > 
> >
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > -.-.-.-.-.-.-.-.-
> > r-help mailing list -- Read 
> > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > Send "info", "help", or "[un]subscribe"
> 
=== message truncated ===


=====
Tom Arnold
Summit Media Partners
Visit our web site at http://www.summitmediapartners.com

__________________________________________________

Yahoo! Greetings - send holiday greetings for Easter, Passover

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list