[R] using ifelse to remove NA's from specific columns of a data frame containing strings and numbers
William Dunlap
wdunlap at tibco.com
Thu Nov 15 17:35:52 CET 2012
Replace your NA's column by column, not all at once.
In your first example, of the form
ifelse(condition, numbers, data.frame)
the second and third arguments are replicated to the length
of the first. A data.frame's length is the number of columns
it has, so ifelse repeats its columns, not what you want.
Also, the 2nd and 3rd arguments to ifelse should be of the same
type, since the output will be a vector that accepts some values
from each. If they don't have the same type the output will be
of some type that can accept values from both types. That type
is often character or list, not what you want
Your second example code used unlist(data.frame). data.frames
contain columns of various classes and unlist(data.frame) creates
a vector with one class, the class is chosen to retain the information,
if not the format, of columns in the data.frame. It is generally not
a useful thing, unless all columns have the same class.
You showed some code but not data, so I'll make up something like
you described
df <- data.frame(stringsAsFactors=FALSE,
Number1 = c(1, 2, 3, NA, 5, 6),
Number2 = c(11, 12, 13, 14, 14, NA),
String = c("one","two",NA,"four","five","six"),
Factor = factor(c("Group A", NA, "Group A", "Group B", "Group B", "Group B")))
Look at its structure with
> str(df)
'data.frame': 6 obs. of 4 variables:
$ Number1: num 1 2 3 NA 5 6
$ Number2: num 11 12 13 14 14 NA
$ String : chr "one" "two" NA "four" ...
$ Factor : Factor w/ 2 levels "Group A","Group B": 1 NA 1 2 2 2
To do the sort of conversion you want try something like
f <- function(d) {
for(i in seq_along(d)) {
di <- d[[i]]
di[is.na(di)] <- if (is.numeric(di)) { # could use switch instead of if-then-else
if (i==2) { 0 } else { 1 }
} else if (is.factor(di)) {
levels(di)[1] # I don't know what you want here
} else if (is.character(di)) {
"Unknown"
}
d[[i]] <- di
}
d
}
That would give you
> str(f(df))
'data.frame': 6 obs. of 4 variables:
$ Number1: num 1 2 3 1 5 6
$ Number2: num 11 12 13 14 14 0
$ String : chr "one" "two" "Unknown" "four" ...
$ Factor : Factor w/ 2 levels "Group A","Group B": 1 1 1 2 2 2
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of David Romano
> Sent: Thursday, November 15, 2012 7:58 AM
> To: Bert Gunter
> Cc: r-help at r-project.org
> Subject: Re: [R] using ifelse to remove NA's from specific columns of a data frame
> containing strings and numbers
>
> Thanks for the suggestion, Bert; I just re-read the introduction with
> particular attention to the sections you mentioned, but I don't see how any
> of it bears on my question. Namely -- to rephrase: What constraints are
> there on the form of the "yes" and "no" values required by ifelse? The
> introduction doesn't really speak to this, and the help documentation seems
> to suggest that as long the shapes of the test, "yes" values, and "no"
> values agree, that would be sufficient -- I don't see anything that
> specifies that any of these should be of a particular data type. My
> example, however, seems to indicate that the "yes" and "no" values can't be
> a mixture of characters and numbers, and I'm trying to figure out what the
> underlying constraints are on ifelse.
>
> Thanks again,
> David
>
> On Thu, Nov 15, 2012 at 6:46 AM, Bert Gunter <gunter.berton at gene.com> wrote:
>
> > David:
> >
> > You seem to be getting lost in basic R tasks. Have you read the Intro
> > to R tutorial? If not, do so, as this should tell you how to do what
> > you need. If so, re-read the sections on indexing ("["), replacement,
> > and NA's. Also read about character vectors and factors.
> >
> > -- Bert
> >
> > On Thu, Nov 15, 2012 at 3:19 AM, David Romano <dromano at stanford.edu>
> > wrote:
> > > Hi everyone,
> > >
> > > I have a data frame one of whose columns is a character vector and the
> > rest
> > > are numeric, and in debugging a script, I noticed that an ifelse call
> > seems
> > > to be coercing the character column to a numeric column, and producing
> > > unintended values as a result. Roughly, here's what I tried to do:
> > >
> > > df: a data frame with, say, the first column as a character column and
> > the
> > > second and third columns numeric.
> > >
> > > also: NA's occur only in the numeric columns, and if they occur in one,
> > > they occur in the other as well.
> > >
> > > I wanted to replace the NA's in column 2 with 0's and the ones in column
> > 3
> > > with 1's, so first I did this:
> > >
> > >> na.replacements <-ifelse(col(df)==2,0,1).
> > >
> > > Then I used a second ifelse call to try to remove the NA's as I wanted,
> > > first by doing this:
> > >
> > >> clean.df <- ifelse(is.na(df), na.replacements, df),
> > >
> > > which produced a list of lists vaguely resembling df, with the NA's
> > mostly
> > > intact, and so then I tried this:
> > >
> > >> clean.df <- ifelse(is.na(df), na.replacements, unlist(df)),
> > >
> > > which seems to work if all the columns are numeric, but otherwise changes
> > > strings to numbers.
> > >
> > > I can't make sense of the help documentation enough to clear this up, but
> > > my guess is that the "yes" and "no" values passed to ifelse need to be
> > > vectors, in which case it seems I'll have to use another approach
> > entirely,
> > > but even if is not the case and lists are acceptable, I'm not sure how to
> > > convert a mixed-mode data frame into a vector-like list of elements
> > (which
> > > I would hope would work).
> > >
> > > I'd be grateful for any suggestions!
> > >
> > > Thanks,
> > > David Romano
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >
> > --
> >
> > Bert Gunter
> > Genentech Nonclinical Biostatistics
> >
> > Internal Contact Info:
> > Phone: 467-7374
> > Website:
> >
> > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-
> biostatistics/pdb-ncb-home.htm
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list