[R] Having trouble converting a dataframe of character vectors to factors
Lopez, Dan
lopez235 at llnl.gov
Thu Feb 21 23:50:35 CET 2013
Hi Bill,
Great info.
The problem is what was originally given to me looks like DPUT1 below (random sample of 25).
This is the only format they can give me this in and the data already looks molten. So I applied reshape2::dcast which resulted in a dataframe made of character vectors; except for the first column which is an integer vector.
So after dropping columns full of "" (blanks) and reordering columns I figured I needed factors to accomplish my goal (refer below) and converted everything to factors with:
> x2[,-1]<-as.data.frame(lapply(x[,-1],as.factor))
and ended up with DPUT2 below (random sample of 25)
Now after reading your last email I figured I've done will since no attributes got dropped and no levels got dropped (just need to add some in because couldn't be derived from original dataframe) and column names seem fine.
Now I have a new problem which is how to reorder levels in a dataframe and possible add some unused. After seeing contents using Hmisc::contents I figured the next logical step is to handle like vectors a chunk at a time.
For example subsetting to grepl("Q1_",names(scs.c2)) gives these vectors which all have identical levels except for one:
$Q1_1 thru $Q1_7 except $Q1_3
[1] "" "important" "not important" "somewhat important" "very important"
$Q1_3
[1] "important" "not important" "somewhat important" "very important"
#So I tried I tried this which had no effect
keepcols<- grepl("Q1_",names(scs.c2))
levels(scs.c2[,keepcols])<-list(NoResp="",NotImportant="not important",SomewhatImpt="somewhat important",Important="important",VeryImpt="very important")
#then this which also failed. It coerced a bunch of NA's and turned the vectors back to character vectors
scs.c2[,keepcols]<-sapply(scs.c2[,keepcols],function(x) factor(x,levels(x)[c(NoResp="",NotImportant="not important",SomewhatImpt="somewhat important",Important="important",VeryImpt="very important")])
Mind you I can easily do this in MS Excel and is probably what I am going to break down and do fairly soon. But I wanted to give this a good solid shot in R because I want to learn to handle these situations in R. I've been using R for almost a year.
__________________________
ADDITIONAL BACKGROUND
MY GOAL
I ultimately want to get started with some basic correlation analysis for some of the columns : taking your example (slightly modified) I hope to be able to do this
xx <- data.frame(stringsAsFactors=FALSE, check.names=FALSE,"No/Yes" = factor(c("Yes","No","No","No"), levels=c("No","Yes")),
"Size" = ordered(c("Small","Large","Medium","Medium"), levels=c("Small","Medium","Large")),"Name" = c("Adam","Bill","Chuck","Larry"))
> cor(sapply(xx[,1:2],as.numeric))
No/Yes Size
No/Yes 1.0000000 -0.8164966
Size -0.8164966 1.0000000
DPUT1
structure(list(svaID = c(771L, 771L, 775L, 775L, 774L, 776L,
774L, 771L, 771L, 771L, 771L, 774L, 774L, 775L, 765L, 775L, 765L,
775L, 771L, 777L, 775L, 771L, 774L, 776L, 776L), question = structure(c(19L,
12L, 23L, 3L, 10L, 36L, 25L, 1L, 30L, 7L, 21L, 13L, 16L, 32L,
6L, 5L, 18L, 19L, 14L, 2L, 2L, 9L, 37L, 28L, 24L), .Label = c("Q1",
"Q1_1", "Q1_2", "Q1_3", "Q1_4", "Q1_5", "Q1_6", "Q1_7", "Q10",
"Q11", "Q12", "Q13", "Q14", "Q15", "Q16", "Q17", "Q17_1", "Q17_2",
"Q17_3", "Q17_4", "Q17_5", "Q18", "Q19", "Q2", "Q20", "Q3", "Q4",
"Q5", "Q6", "Q6_A_1", "Q6_A_2", "Q6_A_3", "Q6_A_4", "Q6_A_5",
"Q7", "Q8", "Q9"), class = "factor"), answer = structure(c(11L,
29L, 29L, 26L, 29L, 29L, 1L, 1L, 1L, 13L, 11L, 1L, 1L, 1L, 26L,
26L, 11L, 11L, 29L, 13L, 13L, 29L, 29L, 29L, 27L), .Label = c("",
"1", "2", "3", "4", "5", "Change of College/University", "Change of Field of Study",
"Confirmed Field of Study", "did not meet expectations", "exceeded expectations",
"Family/Friend", "important", "Live Locally", "LLNL Contact",
"LLNL Housing page", "Local Newspaper", "met expectations", "no",
"None", "Not at All", "not important", "Pursue an Advanced Degree",
"Somewhat", "somewhat important", "very important", "Very Much",
"Web", "yes"), class = "factor")), .Names = c("svaID", "question",
"answer"), row.names = c(68L, 62L, 147L, 113L, 97L, 168L, 111L,
45L, 51L, 43L, 70L, 100L, 108L, 127L, 5L, 115L, 30L, 142L, 64L,
186L, 112L, 59L, 95L, 160L, 157L), class = "data.frame")
DPUT2
structure(list(svaID = c(765L, 771L, 774L, 775L, 776L, 777L,
778L, 779L, 782L, 783L, 786L, 788L, 789L, 790L, 791L, 793L, 794L,
795L, 797L, 801L, 803L, 804L, 805L, 807L, 808L), Q1_1 = structure(c(5L,
5L, 5L, 2L, 5L, 2L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 5L,
5L, 5L, 5L, 5L, 5L, 2L, 2L, 2L), .Label = c("", "important",
"not important", "somewhat important", "very important"), class = "factor"),
Q1_2 = structure(c(2L, 5L, 2L, 5L, 2L, 4L, 3L, 5L, 4L, 2L,
2L, 5L, 2L, 3L, 5L, 2L, 2L, 5L, 5L, 5L, 5L, 2L, 1L, 2L, 3L
), .Label = c("", "important", "not important", "somewhat important",
"very important"), class = "factor"), Q1_3 = structure(c(4L,
4L, 4L, 4L, 4L, 1L, 1L, 4L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 1L,
4L, 4L, 1L, 4L, 4L, 4L, 4L, 1L, 4L), .Label = c("important",
"not important", "somewhat important", "very important"), class = "factor"),
Q1_4 = structure(c(5L, 5L, 5L, 5L, 5L, 2L, 2L, 5L, 2L, 2L,
5L, 5L, 5L, 5L, 5L, 2L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L
), .Label = c("", "important", "not important", "somewhat important",
"very important"), class = "factor"), Q1_5 = structure(c(5L,
3L, 5L, 5L, 3L, 2L, 2L, 3L, 2L, 3L, 5L, 5L, 5L, 4L, 4L, 5L,
5L, 5L, 3L, 3L, 3L, 5L, 2L, 2L, 4L), .Label = c("", "important",
"not important", "somewhat important", "very important"), class = "factor"),
Q1_6 = structure(c(5L, 2L, 2L, 2L, 5L, 2L, 4L, 5L, 4L, 5L,
5L, 5L, 5L, 5L, 5L, 2L, 5L, 2L, 4L, 2L, 4L, 5L, 2L, 4L, 4L
), .Label = c("", "important", "not important", "somewhat important",
"very important"), class = "factor"), Q1_7 = structure(c(3L,
2L, 5L, 2L, 2L, 5L, 2L, 5L, 5L, 5L, 2L, 5L, 5L, 2L, 4L, 2L,
5L, 2L, 3L, 5L, 4L, 5L, 2L, 2L, 4L), .Label = c("", "important",
"not important", "somewhat important", "very important"), class = "factor"),
Q2 = structure(c(4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L,
4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 4L
), .Label = c("", "Not at All", "Somewhat", "Very Much"), class = "factor"),
Q3 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("", "yes"), class = "factor"), Q4 = structure(c(4L,
5L, 6L, 4L, 5L, 5L, 5L, 4L, 5L, 4L, 4L, 4L, 4L, 6L, 3L, 5L,
4L, 4L, 5L, 5L, 4L, 4L, 5L, 5L, 4L), .Label = c("", "Change of College/University",
"Change of Field of Study", "Confirmed Field of Study", "None",
"Pursue an Advanced Degree"), class = "factor"), Q5 = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("", "no",
"yes"), class = "factor"), Q6 = structure(c(3L, 5L, 2L, 2L,
7L, 5L, 7L, 5L, 4L, 5L, 3L, 4L, 4L, 5L, 7L, 5L, 5L, 3L, 4L,
5L, 2L, 3L, 5L, 5L, 4L), .Label = c("", "Family/Friend",
"Live Locally", "LLNL Contact", "LLNL Housing page", "Local Newspaper",
"Web"), class = "factor"), Q6_A_1 = structure(c(1L, 1L, 1L,
6L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2", "3",
"4", "5"), class = "factor"), Q6_A_2 = structure(c(1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "4", "5"), class = "factor"),
Q6_A_3 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "5"), class = "factor"), Q6_A_4 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "5"), class = "factor"),
Q6_A_5 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
3L), .Label = c("", "2", "3", "4", "5"), class = "factor"),
Q8 = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
), .Label = c("", "no", "yes"), class = "factor"), Q9 = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("", "no",
"yes"), class = "factor"), Q10 = structure(c(3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("", "no", "yes"), class = "factor"),
Q11 = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L
), .Label = c("", "no", "yes"), class = "factor"), Q12 = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("", "no",
"yes"), class = "factor"), Q13 = structure(c(3L, 3L, 3L,
3L, 3L, 3L, 1L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("", "no", "yes"), class = "factor"),
Q14 = structure(c(3L, 1L, 1L, 3L, 2L, 3L, 1L, 3L, 3L, 1L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 2L
), .Label = c("", "no", "yes"), class = "factor"), Q15 = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "yes"
), class = "factor"), Q16 = structure(c(4L, 4L, 4L, 3L, 3L,
3L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 3L, 4L, 3L,
3L, 3L, 4L, 3L, 4L), .Label = c("", "did not meet expectations",
"exceeded expectations", "met expectations"), class = "factor"),
Q17_1 = structure(c(3L, 4L, 4L, 3L, 3L, 3L, 4L, 4L, 4L, 3L,
3L, 3L, 3L, 4L, 2L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L
), .Label = c("", "did not meet expectations", "exceeded expectations",
"met expectations"), class = "factor"), Q17_2 = structure(c(3L,
4L, 4L, 3L, 3L, 3L, 4L, 4L, 4L, 3L, 4L, 3L, 3L, 3L, 4L, 4L,
4L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 4L), .Label = c("", "did not meet expectations",
"exceeded expectations", "met expectations"), class = "factor"),
Q17_3 = structure(c(3L, 3L, 4L, 3L, 3L, 4L, 4L, 3L, 4L, 4L,
4L, 4L, 3L, 4L, 4L, 4L, 4L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 4L
), .Label = c("", "did not meet expectations", "exceeded expectations",
"met expectations"), class = "factor"), Q17_4 = structure(c(4L,
4L, 4L, 3L, 2L, 3L, 4L, 3L, 3L, 3L, 4L, 4L, 3L, 4L, 3L, 4L,
4L, 3L, 2L, 4L, 3L, 3L, 3L, 3L, 4L), .Label = c("", "did not meet expectations",
"exceeded expectations", "met expectations"), class = "factor"),
Q17_5 = structure(c(3L, 3L, 4L, 3L, 4L, 4L, 4L, 3L, 4L, 4L,
4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 3L, 4L, 4L, 3L
), .Label = c("", "did not meet expectations", "exceeded expectations",
"met expectations"), class = "factor"), Q18 = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L), .Label = c("", "no",
"yes"), class = "factor"), Q19 = structure(c(3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("", "no", "yes"), class = "factor")), .Names = c("svaID",
"Q1_1", "Q1_2", "Q1_3", "Q1_4", "Q1_5", "Q1_6", "Q1_7", "Q2",
"Q3", "Q4", "Q5", "Q6", "Q6_A_1", "Q6_A_2", "Q6_A_3", "Q6_A_4",
"Q6_A_5", "Q8", "Q9", "Q10", "Q11", "Q12", "Q13", "Q14", "Q15",
"Q16", "Q17_1", "Q17_2", "Q17_3", "Q17_4", "Q17_5", "Q18", "Q19"
), row.names = c(NA, 25L), class = "data.frame")
Thanks.
Dan
-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Thursday, February 21, 2013 8:33 AM
To: Mark Lamias; Lopez, Dan; R help (r-help at r-project.org)
Subject: RE: [R] Having trouble converting a dataframe of character vectors to factors
> scs2<-data.frame(lapply(scs2, factor))
Calling data.frame() on the output of lapply() can result in changing column names and will drop attributes that the input data.frame may have had. I prefer to modify the original data.frame instead of making a new one from scratch to avoid these problems.
Also, calling factor() on a factor will drop any unused levels, which you may not want to do. Calling as.factor will not.
Compare the following three methods
f1 <- function (dataFrame) {
dataFrame[] <- lapply(dataFrame, factor)
dataFrame
}
f2 <- function (dataFrame) {
dataFrame[] <- lapply(dataFrame, as.factor)
dataFrame
}
f3 <- function (dataFrame) {
data.frame(lapply(dataFrame, factor))
}
on the following data.frame
x <- data.frame(stringsAsFactors=FALSE, check.names=FALSE,
"No/Yes" = factor(c("Yes","Yes","Yes"), levels=c("No","Yes")),
"Size" = ordered(c("Small","Large","Medium"), levels=c("Small","Medium","Large")),
"Name" = c("Adam","Bill","Chuck"))
attr(x, "Date") <- as.POSIXlt("2013-02-21")
> str(x)
'data.frame': 3 obs. of 3 variables:
$ No/Yes: Factor w/ 2 levels "No","Yes": 2 2 2
$ Size : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
$ Name : chr "Adam" "Bill" "Chuck"
- attr(*, "Date")= POSIXlt, format: "2013-02-21"
> str(f1(x)) # drops unused levels
'data.frame': 3 obs. of 3 variables:
$ No/Yes: Factor w/ 1 level "Yes": 1 1 1
$ Size : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
$ Name : Factor w/ 3 levels "Adam","Bill",..: 1 2 3
- attr(*, "Date")= POSIXlt, format: "2013-02-21"
> str(f2(x))
'data.frame': 3 obs. of 3 variables:
$ No/Yes: Factor w/ 2 levels "No","Yes": 2 2 2
$ Size : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
$ Name : Factor w/ 3 levels "Adam","Bill",..: 1 2 3
- attr(*, "Date")= POSIXlt, format: "2013-02-21"
> str(f3(x)) # mangles column names, drops unused levels, drops Date attribute
'data.frame': 3 obs. of 3 variables:
$ No.Yes: Factor w/ 1 level "Yes": 1 1 1
$ Size : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
$ Name : Factor w/ 3 levels "Adam","Bill",..: 1 2 3
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Mark Lamias
> Sent: Wednesday, February 20, 2013 6:51 PM
> To: Daniel Lopez; R help (r-help at r-project.org)
> Subject: Re: [R] Having trouble converting a dataframe of character
> vectors to factors
>
> How about this?
>
> scs2<-data.frame(lapply(scs2, factor))
>
>
>
>
> ________________________________
> From: "Lopez, Dan" <lopez235 at llnl.gov>
> To: "R help (r-help at r-project.org)" <r-help at r-project.org>
> Sent: Wednesday, February 20, 2013 7:09 PM
> Subject: [R] Having trouble converting a dataframe of character
> vectors to factors
>
> R Experts,
>
> I have a dataframe made up of character vectors--these are results
> from survey questions. I need to convert them to factors.
>
> I tried the following which did not work:
> scs2<-sapply(scs2,as.factor)
> also this didn't work:
> scs2<-sapply(scs2,function(x) as.factor(x))
>
> After doing either of above I end up with
> >str(scs2)
>
> chr [1:10, 1:10] "very important" "very important" "very important" "very important" ...
>
> - attr(*, "dimnames")=List of 2
>
> ..$ : NULL
>
> ..$ : chr [1:10] "Q1_1" "Q1_2" "Q1_3" "Q1_4" ...
>
> >class(scs2)
> "matrix"
>
> But when I do it one at a time it works:
> scs2$Q1_1<-as.factor(scs2$Q1_1)
> scs2$Q1_2<- as.factor(scs2$Q1_2)
>
> What am I doing wrong? How do I accomplish this with sapply or similar function?
>
> Data for reproducibility:
>
>
> scs2<-structure(list(Q1_1 = c("very important", "very important",
> "very important",
>
> "very important", "very important", "very important", "very
> important",
>
> "somewhat important", "important", "very important"), Q1_2 =
> c("important",
>
> "somewhat important", "very important", "important", "important",
>
> "very important", "somewhat important", "somewhat important",
>
> "very important", "very important"), Q1_3 = c("very important",
>
> "important", "very important", "very important", "important",
>
> "very important", "very important", "somewhat important", "not
> important",
>
> "important"), Q1_4 = c("very important", "important", "very
> important",
>
> "very important", "important", "important", "important", "very
> important",
>
> "somewhat important", "important"), Q1_5 = c("very important",
>
> "not important", "important", "very important", "not important",
>
> "important", "somewhat important", "important", "somewhat important",
>
> "not important"), Q1_6 = c("very important", "not important",
>
> "important", "very important", "somewhat important", "very important",
>
> "very important", "very important", "important", "important"),
>
> Q1_7 = c("very important", "somewhat important", "important",
>
> "somewhat important", "important", "important", "very important",
>
> "very important", "somewhat important", "not important"),
>
> Q2 = c("Somewhat", "Very Much", "Somewhat", "Very Much",
>
> "Very Much", "Very Much", "Very Much", "Very Much", "Very Much",
>
> "Very Much"), Q3 = c("yes", "yes", "yes", "yes", "yes", "yes",
>
> "yes", "yes", "yes", "yes"), Q4 = c("None", "None", "None",
>
> "None", "Confirmed Field of Study", "Confirmed Field of Study",
>
> "Confirmed Field of Study", "None", "None", "None")), .Names =
> c("Q1_1",
>
> "Q1_2", "Q1_3", "Q1_4", "Q1_5", "Q1_6", "Q1_7", "Q2", "Q3", "Q4"
>
> ), row.names = c(78L, 46L, 80L, 196L, 188L, 197L, 39L, 195L,
>
> 172L, 110L), class = "data.frame")
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> [[alternative HTML version deleted]]
More information about the R-help
mailing list