[R] Subscripting problem with is.na()

G.Maubach at gmx.de G.Maubach at gmx.de
Fri Jun 24 09:14:35 CEST 2016


Hi Bert,

many thanks for all your help and your comments. I learn at lot this way.

My question was about is.na() at the first sight but the actual task looks like this:

I have two variables in my customer data that signal if the customer accout was closed by master data management or by sales. Say these variables are closed_mdm and closed_sls. They contain NA if the customer account is still open or a closing code from "01" to "08" if the customer account was closed and why.

For my analysis I need a variable that combines the two variables closed_mdm and closed_sls to set a filter easily on those who are closed not matter what the reason was nor who closed the account.

As I always encounter problems when dealing with ifelse statements and NA I decided to merge these two variables to one variable containing 0 = not closed and 1 = closed. In my context this seems to be - at least to me - a reasonable approach.

Replacement of missing values and merging the variables is the easiest way for me.

-- cut --

cust_id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
closed_mdm <- c("01", NA, NA, NA, "08", "07", NA, NA, "05", NA, NA, NA, "04", NA, NA, NA, NA, NA, NA, NA)
closed_sls <- c(NA, "08", NA, NA, "08", "07", NA, NA, NA, NA, "03", NA, NA, NA, "05", NA, NA, NA, NA, NA)

# 1st try
ds_temp1 <- data.frame(cust_id, closed_mdm, closed_sls)
ds_temp1

ds_temp1$closed <- closed_mdm | closed_sls  # WRONG

# 2nd try
closed_mdm_fac1 <- as.factor(closed_mdm)
closed_sls_fac1 <- as.factor(closed_sls)

ds_temp2 <- data.frame(cust_id, closed_mdm_fac1, closed_sls_fac1)
ds_temp2

ds_temp2$closed <- ds_temp$closed_mdm_fac1 | ds_temp$closed_sls_fac1  # WRONG

# 3rd try
closed_mdm_num1 <- as.numeric(closed_mdm)  # OK
closed_sls_num1 <- as.numeric(closed_sls)  # OK

ds_temp3 <- data.frame(cust_id, closed_mdm_num1, closed_sls_num1)
ds_temp3

ds_temp3$closed <- ds_temp$closed_mdm_num1 | ds_temp$closed_sls_num1  # WRONG

# 4th try
ds_temp4 <- ds_temp3
ds_temp4

# Does not run due to not allowed NA in subscripts
ds_temp4[is.na(ds_temp4$closed_mdm_num1), ds_temp4$closed_mdm_num1] <- 0
ds_temp4[is.na(ds_temp4$closed_sls_num1), ds_temp4$closed_sls_num1] <- 0

# 5th try
ds_temp4$closed_mdm_num1 <- ifelse(is.na(ds_temp4$closed_mdm_num1), 1, 0)
ds_temp4$closed_sls_num1 <- ifelse(is.na(ds_temp4$closed_sls_num1), 1, 0)
ds_temp4

ds_temp4$closed <- ifelse(ds_temp4$closed_mdm_num1 == 1 | ds_temp4$closed_sls_num1 == 1, 1, 0)
ds_temp4

-- cut --

Is there a better way to do it?

Kind regards

Georg


> Gesendet: Donnerstag, 23. Juni 2016 um 23:55 Uhr
> Von: "Bert Gunter" <bgunter.4567 at gmail.com>
> An: "David L Carlson" <dcarlson at tamu.edu>
> Cc: "R Help" <r-help at r-project.org>
> Betreff: Re: [R] Subscripting problem with is.na()
>
> ... actually, FWIW, I would say that this little discussion mostly
> demonstrates why the OP's request is probably not a good idea in the
> first place. Usually, NA's should be left as NA's to be dealt with
> properly by R and packages. In biological measurements, for example,
> NA's often mean "below the ability to reliably measure." Biologists
> with whom I've worked over many years often want to convert these to 0
> or omit the cases, both of which lead to biased estimates and/or
> underestimates of variability and excess claims of "statistical
> significance" (for those who belong to this religious persuasion). One
> should never say never, but I suspect that there are relatively few
> circumstances where the conversion the OP requested is actually wise.
> 
> Feel free to ignore/reject such extraneous comments of course.
> 
> Cheers,
> Bert
> 
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> 
> On Thu, Jun 23, 2016 at 12:14 PM, David L Carlson <dcarlson at tamu.edu> wrote:
> > Good point. I did not think about factors. Also your example raises another issue since column c is logical, but gets silently converted to numeric. This would seem to get the job done assuming the conversion is intended for numeric columns only:
> >
> >> test <- data.frame(a=c(1,NA,2), b = c("A","b",NA), c= rep(NA,3))
> >> sapply(test, class)
> >         a         b         c
> > "numeric"  "factor" "logical"
> >> num <- sapply(test, is.numeric)
> >> test[, num][is.na(test[, num])] <- 0
> >> test
> >   a    b  c
> > 1 1    A NA
> > 2 0    b NA
> > 3 2 <NA> NA
> >
> > David C
> >
> > -----Original Message-----
> > From: Bert Gunter [mailto:bgunter.4567 at gmail.com]
> > Sent: Thursday, June 23, 2016 1:48 PM
> > To: David L Carlson
> > Cc: Ivan Calandra; R Help
> > Subject: Re: [R] Subscripting problem with is.na()
> >
> > Not in general, David:
> >
> > e.g.
> >
> >> test <- data.frame(a=c(1,NA,2), b = c("A","b",NA), c= rep(NA,3))
> >
> >> is.na(test)
> >          a     b    c
> > [1,] FALSE FALSE TRUE
> > [2,]  TRUE FALSE TRUE
> > [3,] FALSE  TRUE TRUE
> >
> >> test[is.na(test)]
> > [1] NA NA NA NA NA
> >
> >> test[is.na(test)] <- 0
> > Warning message:
> > In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
> >   invalid factor level, NA generated
> >
> >> test
> >   a    b c
> > 1 1    A 0
> > 2 0    b 0
> > 3 2 <NA> 0
> >
> >
> > The problem is the default conversion to factors and the replacement
> > operation for factors. So:
> >
> >> test <- data.frame(a=c(1,NA,2), b = I(c("A","b",NA_character_)), c= rep(NA,3))
> >> class(test$b)
> > [1] "AsIs"  ## so NOT a factor
> >
> >> test[is.na(test)] <- 0 # now works as you describe
> >> test
> >   a b c
> > 1 1 A 0
> > 2 0 b 0
> > 3 2 0 0
> >
> > Of course the OP (and you) probably had a data frame of all numerics
> > in mind, so the problem doesn't arise. But I think one needs to make
> > the distinction and issue clear.
> >
> > Cheers,
> > Bert
> >
> >
> >
> >
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> > and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Thu, Jun 23, 2016 at 8:46 AM, David L Carlson <dcarlson at tamu.edu> wrote:
> >> The function is.na() returns a matrix when applied to a data.frame so you can easily convert all the NAs to 0's:
> >>
> >>> ds_test
> >>    var1 var2
> >> 1     1    1
> >> 2     2    2
> >> 3     3    3
> >> 4    NA   NA
> >> 5     5    5
> >> 6     6    6
> >> 7     7    7
> >> 8    NA   NA
> >> 9     9    9
> >> 10   10   10
> >>> is.na(ds_test)
> >>        var1  var2
> >>  [1,] FALSE FALSE
> >>  [2,] FALSE FALSE
> >>  [3,] FALSE FALSE
> >>  [4,]  TRUE  TRUE
> >>  [5,] FALSE FALSE
> >>  [6,] FALSE FALSE
> >>  [7,] FALSE FALSE
> >>  [8,]  TRUE  TRUE
> >>  [9,] FALSE FALSE
> >> [10,] FALSE FALSE
> >>> ds_test[is.na(ds_test)] <- 0
> >>> ds_test
> >>    var1 var2
> >> 1     1    1
> >> 2     2    2
> >> 3     3    3
> >> 4     0    0
> >> 5     5    5
> >> 6     6    6
> >> 7     7    7
> >> 8     0    0
> >> 9     9    9
> >> 10   10   10
> >>
> >> -------------------------------------
> >> David L Carlson
> >> Department of Anthropology
> >> Texas A&M University
> >> College Station, TX 77840-4352
> >>
> >> -----Original Message-----
> >> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Ivan Calandra
> >> Sent: Thursday, June 23, 2016 10:14 AM
> >> To: R Help
> >> Subject: Re: [R] Subscripting problem with is.na()
> >>
> >> Thank you Bert for this clarification. It is indeed an important point.
> >>
> >> Ivan
> >>
> >> --
> >> Ivan Calandra, PhD
> >> Scientific Mediator
> >> University of Reims Champagne-Ardenne
> >> GEGENAA - EA 3795
> >> CREA - 2 esplanade Roland Garros
> >> 51100 Reims, France
> >> +33(0)3 26 77 36 89
> >> ivan.calandra at univ-reims.fr
> >> --
> >> https://www.researchgate.net/profile/Ivan_Calandra
> >> https://publons.com/author/705639/
> >>
> >> Le 23/06/2016 à 17:06, Bert Gunter a écrit :
> >>> Sorry, Ivan, your statement is incorrect:
> >>>
> >>> "When you use a single bracket on a list with only one argument in
> >>> between, then R extracts "elements", i.e. columns in the case of a
> >>> data.frame. This explains your errors. "
> >>>
> >>> e.g.
> >>>
> >>>> ex <- data.frame(a = 1:3, b = letters[1:3])
> >>>> a <- 1:3
> >>>> identical(ex[1], a)
> >>> [1] FALSE
> >>>
> >>>> class(ex[1])
> >>> [1] "data.frame"
> >>>> class(a)
> >>> [1] "integer"
> >>>
> >>> Compare:
> >>>
> >>>> identical(ex[[1]], a)
> >>> [1] TRUE
> >>>
> >>> Why? Single bracket extraction on a list results in a list; double
> >>> bracket extraction results in the element of the list ( a "column" in
> >>> the case of a data frame, which is a specific kind of list). The
> >>> relevant sections of ?Extract are:
> >>>
> >>> "Indexing by [ is similar to atomic vectors and selects a **list** of
> >>> the specified element(s).
> >>>
> >>> Both [[ and $ select a **single element of the list**. "
> >>>
> >>>
> >>> Hope this clarifies this often-confused issue.
> >>>
> >>>
> >>> Cheers,
> >>> Bert
> >>> Bert Gunter
> >>>
> >>> "The trouble with having an open mind is that people keep coming along
> >>> and sticking things into it."
> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>
> >>>
> >>> On Thu, Jun 23, 2016 at 7:34 AM, Ivan Calandra
> >>> <ivan.calandra at univ-reims.fr> wrote:
> >>>> My statement "Using a single bracket '[' on a data.frame does the same as
> >>>> for matrices: you need to specify rows and columns" was not correct.
> >>>>
> >>>>
> >>>> When you use a single bracket on a list with only one argument in between,
> >>>> then R extracts "elements", i.e. columns in the case of a data.frame. This
> >>>> explains your errors.
> >>>>
> >>>> But it is possible to use a single bracket on a data.frame with 2 arguments
> >>>> (rows, columns) separated by a comma, as with matrices. This is the solution
> >>>> you received.
> >>>>
> >>>> Ivan
> >>>>
> >>>>
> >>>> --
> >>>> Ivan Calandra, PhD
> >>>> Scientific Mediator
> >>>> University of Reims Champagne-Ardenne
> >>>> GEGENAA - EA 3795
> >>>> CREA - 2 esplanade Roland Garros
> >>>> 51100 Reims, France
> >>>> +33(0)3 26 77 36 89
> >>>> ivan.calandra at univ-reims.fr
> >>>> --
> >>>> https://www.researchgate.net/profile/Ivan_Calandra
> >>>> https://publons.com/author/705639/
> >>>>
> >>>> Le 23/06/2016 à 16:27, Ivan Calandra a écrit :
> >>>>> Dear Georg,
> >>>>>
> >>>>> You need to learn a bit more about the subsetting methods, depending on
> >>>>> the object structure you're trying to subset.
> >>>>>
> >>>>> More specifically, when you run this: ds_test[is.na(ds_test$var1)]
> >>>>> you get this error: "Error in `[.data.frame`(ds_test, is.na(ds_test$var1))
> >>>>> : undefined columns selected"
> >>>>>
> >>>>> This means that R does not understand which column you're trying to
> >>>>> select. But you're actually trying to select rows.
> >>>>>
> >>>>> Using a single bracket '[' on a data.frame does the same as for matrices:
> >>>>> you need to specify rows and columns, like this:
> >>>>> ds_test[is.na(ds_test$var1), ] ## notice the last comma
> >>>>> ds_test[is.na(ds_test$var1), ] <- 0 ## works on all columns because you
> >>>>> didn't specify any after the comma
> >>>>>
> >>>>> If you want it only for "var1", then you need to specify the column:
> >>>>> ds_test[is.na(ds_test$var1), "var1"] <- 0
> >>>>>
> >>>>> It's the same problem with your 2nd and 4th tries (4th one has other
> >>>>> problems). Your 3rd try does not change ds_test at all.
> >>>>>
> >>>>> HTH,
> >>>>> Ivan
> >>>>>
> >>>>> --
> >>>>> Ivan Calandra, PhD
> >>>>> Scientific Mediator
> >>>>> University of Reims Champagne-Ardenne
> >>>>> GEGENAA - EA 3795
> >>>>> CREA - 2 esplanade Roland Garros
> >>>>> 51100 Reims, France
> >>>>> +33(0)3 26 77 36 89
> >>>>> ivan.calandra at univ-reims.fr
> >>>>> --
> >>>>> https://www.researchgate.net/profile/Ivan_Calandra
> >>>>> https://publons.com/author/705639/
> >>>>>
> >>>>> Le 23/06/2016 à 15:57, G.Maubach at weinwolf.de a écrit :
> >>>>>> Hi All,
> >>>>>>
> >>>>>> I would like to recode my NAs to 0. Using a single vector everything is
> >>>>>> fine.
> >>>>>>
> >>>>>> But if I use a data.frame things go wrong:
> >>>>>>
> >>>>>> -- cut --
> >>>>>>
> >>>>>> var1 <- c(1:3, NA, 5:7, NA, 9:10)
> >>>>>> var2 <- c(1:3, NA, 5:7, NA, 9:10)
> >>>>>> ds_test <-
> >>>>>>     data.frame(var1, var2)
> >>>>>>
> >>>>>> test <- var1
> >>>>>> test[is.na(test)] <- 0
> >>>>>> test  # NA recoded OK
> >>>>>>
> >>>>>> # First try
> >>>>>> ds_test[is.na(ds_test$var1)] <- 0  # duplicate subscripts WRONG
> >>>>>>
> >>>>>> # Second try
> >>>>>> ds_test[is.na("var1")] <- 0
> >>>>>> ds_test$var1  # not recoded WRONG
> >>>>>>
> >>>>>> # Third try: to me the most intuitive approach
> >>>>>> is.na(ds_test["var1"]) <- 0  # attempt to select less than one element in
> >>>>>> integerOneIndex WRONG
> >>>>>>
> >>>>>> # Fourth try
> >>>>>> ds_test[is.na(var1)] <- 0  # duplicate subscripts for columns WRONG
> >>>>>>
> >>>>>> -- cut --
> >>>>>>    How can I do it correctly?
> >>>>>>
> >>>>>> Where could I have found something about it?
> >>>>>>
> >>>>>> Kind regards
> >>>>>>
> >>>>>> Georg
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list