[R] Subscripting problem with is.na()
MacQueen, Don
macqueen1 at llnl.gov
Fri Jun 24 17:19:35 CEST 2016
See insert below.
--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
On 6/24/16, 12:14 AM, "R-help on behalf of G.Maubach at gmx.de"
<r-help-bounces at r-project.org on behalf of G.Maubach at gmx.de> wrote:
>Hi Bert,
>
>many thanks for all your help and your comments. I learn at lot this way.
>
>My question was about is.na() at the first sight but the actual task
>looks like this:
>
>I have two variables in my customer data that signal if the customer
>accout was closed by master data management or by sales. Say these
>variables are closed_mdm and closed_sls. They contain NA if the customer
>account is still open or a closing code from "01" to "08" if the customer
>account was closed and why.
>
>For my analysis I need a variable that combines the two variables
>closed_mdm and closed_sls to set a filter easily on those who are closed
>not matter what the reason was nor who closed the account.
Given that description, this would seem to do the job:
cust.id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20)
closed.mdm <- c("01", NA, NA, NA, "08", "07", NA, NA, "05",
NA, NA, NA, "04", NA, NA, NA, NA, NA, NA, NA)
closed.sls <- c(NA, "08", NA, NA, "08", "07", NA, NA, NA, NA,
"03", NA, NA, NA, "05", NA, NA, NA, NA, NA)
df <- data.frame(cust.id, closed.mdm, closed.sls,
stringsAsFactors=FALSE)
df$opcl <- ifelse( is.na(closed.mdm) & is.na(closed.sls) ,
'open','closed')
Then use the opcl column to filter, e.g.,
subset(df, opcl=='open')
If you want to operate directly on one of the 'closed' column, perhaps
these examples will help
## does not work due to the NAs
df[ df$closed.sls == '08',]
## workd
subset(df, closed.sls=='08')
## works
df[ !is.na(df$closed.sls) & df$closed.sls == '08',]
>
>As I always encounter problems when dealing with ifelse statements and NA
>I decided to merge these two variables to one variable containing 0 = not
>closed and 1 = closed. In my context this seems to be - at least to me -
>a reasonable approach.
>
>Replacement of missing values and merging the variables is the easiest
>way for me.
>
>-- cut --
>
>cust_id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
>18, 19, 20)
>closed_mdm <- c("01", NA, NA, NA, "08", "07", NA, NA, "05", NA, NA, NA,
>"04", NA, NA, NA, NA, NA, NA, NA)
>closed_sls <- c(NA, "08", NA, NA, "08", "07", NA, NA, NA, NA, "03", NA,
>NA, NA, "05", NA, NA, NA, NA, NA)
>
># 1st try
>ds_temp1 <- data.frame(cust_id, closed_mdm, closed_sls)
>ds_temp1
>
>ds_temp1$closed <- closed_mdm | closed_sls # WRONG
>
># 2nd try
>closed_mdm_fac1 <- as.factor(closed_mdm)
>closed_sls_fac1 <- as.factor(closed_sls)
>
>ds_temp2 <- data.frame(cust_id, closed_mdm_fac1, closed_sls_fac1)
>ds_temp2
>
>ds_temp2$closed <- ds_temp$closed_mdm_fac1 | ds_temp$closed_sls_fac1 #
>WRONG
>
># 3rd try
>closed_mdm_num1 <- as.numeric(closed_mdm) # OK
>closed_sls_num1 <- as.numeric(closed_sls) # OK
>
>ds_temp3 <- data.frame(cust_id, closed_mdm_num1, closed_sls_num1)
>ds_temp3
>
>ds_temp3$closed <- ds_temp$closed_mdm_num1 | ds_temp$closed_sls_num1 #
>WRONG
>
># 4th try
>ds_temp4 <- ds_temp3
>ds_temp4
>
># Does not run due to not allowed NA in subscripts
>ds_temp4[is.na(ds_temp4$closed_mdm_num1), ds_temp4$closed_mdm_num1] <- 0
>ds_temp4[is.na(ds_temp4$closed_sls_num1), ds_temp4$closed_sls_num1] <- 0
>
># 5th try
>ds_temp4$closed_mdm_num1 <- ifelse(is.na(ds_temp4$closed_mdm_num1), 1, 0)
>ds_temp4$closed_sls_num1 <- ifelse(is.na(ds_temp4$closed_sls_num1), 1, 0)
>ds_temp4
>
>ds_temp4$closed <- ifelse(ds_temp4$closed_mdm_num1 == 1 |
>ds_temp4$closed_sls_num1 == 1, 1, 0)
>ds_temp4
>
>-- cut --
>
>Is there a better way to do it?
>
>Kind regards
>
>Georg
>
>
>> Gesendet: Donnerstag, 23. Juni 2016 um 23:55 Uhr
>> Von: "Bert Gunter" <bgunter.4567 at gmail.com>
>> An: "David L Carlson" <dcarlson at tamu.edu>
>> Cc: "R Help" <r-help at r-project.org>
>> Betreff: Re: [R] Subscripting problem with is.na()
>>
>> ... actually, FWIW, I would say that this little discussion mostly
>> demonstrates why the OP's request is probably not a good idea in the
>> first place. Usually, NA's should be left as NA's to be dealt with
>> properly by R and packages. In biological measurements, for example,
>> NA's often mean "below the ability to reliably measure." Biologists
>> with whom I've worked over many years often want to convert these to 0
>> or omit the cases, both of which lead to biased estimates and/or
>> underestimates of variability and excess claims of "statistical
>> significance" (for those who belong to this religious persuasion). One
>> should never say never, but I suspect that there are relatively few
>> circumstances where the conversion the OP requested is actually wise.
>>
>> Feel free to ignore/reject such extraneous comments of course.
>>
>> Cheers,
>> Bert
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Thu, Jun 23, 2016 at 12:14 PM, David L Carlson <dcarlson at tamu.edu>
>>wrote:
>> > Good point. I did not think about factors. Also your example raises
>>another issue since column c is logical, but gets silently converted to
>>numeric. This would seem to get the job done assuming the conversion is
>>intended for numeric columns only:
>> >
>> >> test <- data.frame(a=c(1,NA,2), b = c("A","b",NA), c= rep(NA,3))
>> >> sapply(test, class)
>> > a b c
>> > "numeric" "factor" "logical"
>> >> num <- sapply(test, is.numeric)
>> >> test[, num][is.na(test[, num])] <- 0
>> >> test
>> > a b c
>> > 1 1 A NA
>> > 2 0 b NA
>> > 3 2 <NA> NA
>> >
>> > David C
>> >
>> > -----Original Message-----
>> > From: Bert Gunter [mailto:bgunter.4567 at gmail.com]
>> > Sent: Thursday, June 23, 2016 1:48 PM
>> > To: David L Carlson
>> > Cc: Ivan Calandra; R Help
>> > Subject: Re: [R] Subscripting problem with is.na()
>> >
>> > Not in general, David:
>> >
>> > e.g.
>> >
>> >> test <- data.frame(a=c(1,NA,2), b = c("A","b",NA), c= rep(NA,3))
>> >
>> >> is.na(test)
>> > a b c
>> > [1,] FALSE FALSE TRUE
>> > [2,] TRUE FALSE TRUE
>> > [3,] FALSE TRUE TRUE
>> >
>> >> test[is.na(test)]
>> > [1] NA NA NA NA NA
>> >
>> >> test[is.na(test)] <- 0
>> > Warning message:
>> > In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
>> > invalid factor level, NA generated
>> >
>> >> test
>> > a b c
>> > 1 1 A 0
>> > 2 0 b 0
>> > 3 2 <NA> 0
>> >
>> >
>> > The problem is the default conversion to factors and the replacement
>> > operation for factors. So:
>> >
>> >> test <- data.frame(a=c(1,NA,2), b = I(c("A","b",NA_character_)), c=
>>rep(NA,3))
>> >> class(test$b)
>> > [1] "AsIs" ## so NOT a factor
>> >
>> >> test[is.na(test)] <- 0 # now works as you describe
>> >> test
>> > a b c
>> > 1 1 A 0
>> > 2 0 b 0
>> > 3 2 0 0
>> >
>> > Of course the OP (and you) probably had a data frame of all numerics
>> > in mind, so the problem doesn't arise. But I think one needs to make
>> > the distinction and issue clear.
>> >
>> > Cheers,
>> > Bert
>> >
>> >
>> >
>> >
>> >
>> > Bert Gunter
>> >
>> > "The trouble with having an open mind is that people keep coming along
>> > and sticking things into it."
>> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>> >
>> >
>> > On Thu, Jun 23, 2016 at 8:46 AM, David L Carlson <dcarlson at tamu.edu>
>>wrote:
>> >> The function is.na() returns a matrix when applied to a data.frame
>>so you can easily convert all the NAs to 0's:
>> >>
>> >>> ds_test
>> >> var1 var2
>> >> 1 1 1
>> >> 2 2 2
>> >> 3 3 3
>> >> 4 NA NA
>> >> 5 5 5
>> >> 6 6 6
>> >> 7 7 7
>> >> 8 NA NA
>> >> 9 9 9
>> >> 10 10 10
>> >>> is.na(ds_test)
>> >> var1 var2
>> >> [1,] FALSE FALSE
>> >> [2,] FALSE FALSE
>> >> [3,] FALSE FALSE
>> >> [4,] TRUE TRUE
>> >> [5,] FALSE FALSE
>> >> [6,] FALSE FALSE
>> >> [7,] FALSE FALSE
>> >> [8,] TRUE TRUE
>> >> [9,] FALSE FALSE
>> >> [10,] FALSE FALSE
>> >>> ds_test[is.na(ds_test)] <- 0
>> >>> ds_test
>> >> var1 var2
>> >> 1 1 1
>> >> 2 2 2
>> >> 3 3 3
>> >> 4 0 0
>> >> 5 5 5
>> >> 6 6 6
>> >> 7 7 7
>> >> 8 0 0
>> >> 9 9 9
>> >> 10 10 10
>> >>
>> >> -------------------------------------
>> >> David L Carlson
>> >> Department of Anthropology
>> >> Texas A&M University
>> >> College Station, TX 77840-4352
>> >>
>> >> -----Original Message-----
>> >> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Ivan
>>Calandra
>> >> Sent: Thursday, June 23, 2016 10:14 AM
>> >> To: R Help
>> >> Subject: Re: [R] Subscripting problem with is.na()
>> >>
>> >> Thank you Bert for this clarification. It is indeed an important
>>point.
>> >>
>> >> Ivan
>> >>
>> >> --
>> >> Ivan Calandra, PhD
>> >> Scientific Mediator
>> >> University of Reims Champagne-Ardenne
>> >> GEGENAA - EA 3795
>> >> CREA - 2 esplanade Roland Garros
>> >> 51100 Reims, France
>> >> +33(0)3 26 77 36 89
>> >> ivan.calandra at univ-reims.fr
>> >> --
>> >> https://www.researchgate.net/profile/Ivan_Calandra
>> >> https://publons.com/author/705639/
>> >>
>> >> Le 23/06/2016 à 17:06, Bert Gunter a écrit :
>> >>> Sorry, Ivan, your statement is incorrect:
>> >>>
>> >>> "When you use a single bracket on a list with only one argument in
>> >>> between, then R extracts "elements", i.e. columns in the case of a
>> >>> data.frame. This explains your errors. "
>> >>>
>> >>> e.g.
>> >>>
>> >>>> ex <- data.frame(a = 1:3, b = letters[1:3])
>> >>>> a <- 1:3
>> >>>> identical(ex[1], a)
>> >>> [1] FALSE
>> >>>
>> >>>> class(ex[1])
>> >>> [1] "data.frame"
>> >>>> class(a)
>> >>> [1] "integer"
>> >>>
>> >>> Compare:
>> >>>
>> >>>> identical(ex[[1]], a)
>> >>> [1] TRUE
>> >>>
>> >>> Why? Single bracket extraction on a list results in a list; double
>> >>> bracket extraction results in the element of the list ( a "column"
>>in
>> >>> the case of a data frame, which is a specific kind of list). The
>> >>> relevant sections of ?Extract are:
>> >>>
>> >>> "Indexing by [ is similar to atomic vectors and selects a **list**
>>of
>> >>> the specified element(s).
>> >>>
>> >>> Both [[ and $ select a **single element of the list**. "
>> >>>
>> >>>
>> >>> Hope this clarifies this often-confused issue.
>> >>>
>> >>>
>> >>> Cheers,
>> >>> Bert
>> >>> Bert Gunter
>> >>>
>> >>> "The trouble with having an open mind is that people keep coming
>>along
>> >>> and sticking things into it."
>> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>> >>>
>> >>>
>> >>> On Thu, Jun 23, 2016 at 7:34 AM, Ivan Calandra
>> >>> <ivan.calandra at univ-reims.fr> wrote:
>> >>>> My statement "Using a single bracket '[' on a data.frame does the
>>same as
>> >>>> for matrices: you need to specify rows and columns" was not
>>correct.
>> >>>>
>> >>>>
>> >>>> When you use a single bracket on a list with only one argument in
>>between,
>> >>>> then R extracts "elements", i.e. columns in the case of a
>>data.frame. This
>> >>>> explains your errors.
>> >>>>
>> >>>> But it is possible to use a single bracket on a data.frame with 2
>>arguments
>> >>>> (rows, columns) separated by a comma, as with matrices. This is
>>the solution
>> >>>> you received.
>> >>>>
>> >>>> Ivan
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Ivan Calandra, PhD
>> >>>> Scientific Mediator
>> >>>> University of Reims Champagne-Ardenne
>> >>>> GEGENAA - EA 3795
>> >>>> CREA - 2 esplanade Roland Garros
>> >>>> 51100 Reims, France
>> >>>> +33(0)3 26 77 36 89
>> >>>> ivan.calandra at univ-reims.fr
>> >>>> --
>> >>>> https://www.researchgate.net/profile/Ivan_Calandra
>> >>>> https://publons.com/author/705639/
>> >>>>
>> >>>> Le 23/06/2016 à 16:27, Ivan Calandra a écrit :
>> >>>>> Dear Georg,
>> >>>>>
>> >>>>> You need to learn a bit more about the subsetting methods,
>>depending on
>> >>>>> the object structure you're trying to subset.
>> >>>>>
>> >>>>> More specifically, when you run this: ds_test[is.na(ds_test$var1)]
>> >>>>> you get this error: "Error in `[.data.frame`(ds_test,
>>is.na(ds_test$var1))
>> >>>>> : undefined columns selected"
>> >>>>>
>> >>>>> This means that R does not understand which column you're trying
>>to
>> >>>>> select. But you're actually trying to select rows.
>> >>>>>
>> >>>>> Using a single bracket '[' on a data.frame does the same as for
>>matrices:
>> >>>>> you need to specify rows and columns, like this:
>> >>>>> ds_test[is.na(ds_test$var1), ] ## notice the last comma
>> >>>>> ds_test[is.na(ds_test$var1), ] <- 0 ## works on all columns
>>because you
>> >>>>> didn't specify any after the comma
>> >>>>>
>> >>>>> If you want it only for "var1", then you need to specify the
>>column:
>> >>>>> ds_test[is.na(ds_test$var1), "var1"] <- 0
>> >>>>>
>> >>>>> It's the same problem with your 2nd and 4th tries (4th one has
>>other
>> >>>>> problems). Your 3rd try does not change ds_test at all.
>> >>>>>
>> >>>>> HTH,
>> >>>>> Ivan
>> >>>>>
>> >>>>> --
>> >>>>> Ivan Calandra, PhD
>> >>>>> Scientific Mediator
>> >>>>> University of Reims Champagne-Ardenne
>> >>>>> GEGENAA - EA 3795
>> >>>>> CREA - 2 esplanade Roland Garros
>> >>>>> 51100 Reims, France
>> >>>>> +33(0)3 26 77 36 89
>> >>>>> ivan.calandra at univ-reims.fr
>> >>>>> --
>> >>>>> https://www.researchgate.net/profile/Ivan_Calandra
>> >>>>> https://publons.com/author/705639/
>> >>>>>
>> >>>>> Le 23/06/2016 à 15:57, G.Maubach at weinwolf.de a écrit :
>> >>>>>> Hi All,
>> >>>>>>
>> >>>>>> I would like to recode my NAs to 0. Using a single vector
>>everything is
>> >>>>>> fine.
>> >>>>>>
>> >>>>>> But if I use a data.frame things go wrong:
>> >>>>>>
>> >>>>>> -- cut --
>> >>>>>>
>> >>>>>> var1 <- c(1:3, NA, 5:7, NA, 9:10)
>> >>>>>> var2 <- c(1:3, NA, 5:7, NA, 9:10)
>> >>>>>> ds_test <-
>> >>>>>> data.frame(var1, var2)
>> >>>>>>
>> >>>>>> test <- var1
>> >>>>>> test[is.na(test)] <- 0
>> >>>>>> test # NA recoded OK
>> >>>>>>
>> >>>>>> # First try
>> >>>>>> ds_test[is.na(ds_test$var1)] <- 0 # duplicate subscripts WRONG
>> >>>>>>
>> >>>>>> # Second try
>> >>>>>> ds_test[is.na("var1")] <- 0
>> >>>>>> ds_test$var1 # not recoded WRONG
>> >>>>>>
>> >>>>>> # Third try: to me the most intuitive approach
>> >>>>>> is.na(ds_test["var1"]) <- 0 # attempt to select less than one
>>element in
>> >>>>>> integerOneIndex WRONG
>> >>>>>>
>> >>>>>> # Fourth try
>> >>>>>> ds_test[is.na(var1)] <- 0 # duplicate subscripts for columns
>>WRONG
>> >>>>>>
>> >>>>>> -- cut --
>> >>>>>> How can I do it correctly?
>> >>>>>>
>> >>>>>> Where could I have found something about it?
>> >>>>>>
>> >>>>>> Kind regards
>> >>>>>>
>> >>>>>> Georg
>> >>>>>>
>> >>>>>> ______________________________________________
>> >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>>>> PLEASE do read the posting guide
>> >>>>>> http://www.R-project.org/posting-guide.html
>> >>>>>> and provide commented, minimal, self-contained, reproducible
>>code.
>> >>>>>>
>> >>>>> ______________________________________________
>> >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>>> PLEASE do read the posting guide
>> >>>>> http://www.R-project.org/posting-guide.html
>> >>>>> and provide commented, minimal, self-contained, reproducible code.
>> >>>>>
>> >>>> ______________________________________________
>> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>> PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>> >>>> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list