[R] Subscripting problem with is.na()

David L Carlson dcarlson at tamu.edu
Fri Jun 24 16:04:12 CEST 2016


Yes, measurements below detection should be treated differently. I thought about the missing data issue, but there is another context in which spreadsheet data containing count data where 0 entries are deliberately left blank for readability or economy. In that case it is easier to import and use R to replace the missing 0s than to fill the missing cell entries in the spreadsheet before importing it.

David C

-----Original Message-----
From: Bert Gunter [mailto:bgunter.4567 at gmail.com] 
Sent: Thursday, June 23, 2016 4:56 PM
To: David L Carlson
Cc: Ivan Calandra; R Help
Subject: Re: [R] Subscripting problem with is.na()

... actually, FWIW, I would say that this little discussion mostly
demonstrates why the OP's request is probably not a good idea in the
first place. Usually, NA's should be left as NA's to be dealt with
properly by R and packages. In biological measurements, for example,
NA's often mean "below the ability to reliably measure." Biologists
with whom I've worked over many years often want to convert these to 0
or omit the cases, both of which lead to biased estimates and/or
underestimates of variability and excess claims of "statistical
significance" (for those who belong to this religious persuasion). One
should never say never, but I suspect that there are relatively few
circumstances where the conversion the OP requested is actually wise.

Feel free to ignore/reject such extraneous comments of course.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Thu, Jun 23, 2016 at 12:14 PM, David L Carlson <dcarlson at tamu.edu> wrote:
> Good point. I did not think about factors. Also your example raises another issue since column c is logical, but gets silently converted to numeric. This would seem to get the job done assuming the conversion is intended for numeric columns only:
>
>> test <- data.frame(a=c(1,NA,2), b = c("A","b",NA), c= rep(NA,3))
>> sapply(test, class)
>         a         b         c
> "numeric"  "factor" "logical"
>> num <- sapply(test, is.numeric)
>> test[, num][is.na(test[, num])] <- 0
>> test
>   a    b  c
> 1 1    A NA
> 2 0    b NA
> 3 2 <NA> NA
>
> David C
>
> -----Original Message-----
> From: Bert Gunter [mailto:bgunter.4567 at gmail.com]
> Sent: Thursday, June 23, 2016 1:48 PM
> To: David L Carlson
> Cc: Ivan Calandra; R Help
> Subject: Re: [R] Subscripting problem with is.na()
>
> Not in general, David:
>
> e.g.
>
>> test <- data.frame(a=c(1,NA,2), b = c("A","b",NA), c= rep(NA,3))
>
>> is.na(test)
>          a     b    c
> [1,] FALSE FALSE TRUE
> [2,]  TRUE FALSE TRUE
> [3,] FALSE  TRUE TRUE
>
>> test[is.na(test)]
> [1] NA NA NA NA NA
>
>> test[is.na(test)] <- 0
> Warning message:
> In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
>   invalid factor level, NA generated
>
>> test
>   a    b c
> 1 1    A 0
> 2 0    b 0
> 3 2 <NA> 0
>
>
> The problem is the default conversion to factors and the replacement
> operation for factors. So:
>
>> test <- data.frame(a=c(1,NA,2), b = I(c("A","b",NA_character_)), c= rep(NA,3))
>> class(test$b)
> [1] "AsIs"  ## so NOT a factor
>
>> test[is.na(test)] <- 0 # now works as you describe
>> test
>   a b c
> 1 1 A 0
> 2 0 b 0
> 3 2 0 0
>
> Of course the OP (and you) probably had a data frame of all numerics
> in mind, so the problem doesn't arise. But I think one needs to make
> the distinction and issue clear.
>
> Cheers,
> Bert
>
>
>
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Thu, Jun 23, 2016 at 8:46 AM, David L Carlson <dcarlson at tamu.edu> wrote:
>> The function is.na() returns a matrix when applied to a data.frame so you can easily convert all the NAs to 0's:
>>
>>> ds_test
>>    var1 var2
>> 1     1    1
>> 2     2    2
>> 3     3    3
>> 4    NA   NA
>> 5     5    5
>> 6     6    6
>> 7     7    7
>> 8    NA   NA
>> 9     9    9
>> 10   10   10
>>> is.na(ds_test)
>>        var1  var2
>>  [1,] FALSE FALSE
>>  [2,] FALSE FALSE
>>  [3,] FALSE FALSE
>>  [4,]  TRUE  TRUE
>>  [5,] FALSE FALSE
>>  [6,] FALSE FALSE
>>  [7,] FALSE FALSE
>>  [8,]  TRUE  TRUE
>>  [9,] FALSE FALSE
>> [10,] FALSE FALSE
>>> ds_test[is.na(ds_test)] <- 0
>>> ds_test
>>    var1 var2
>> 1     1    1
>> 2     2    2
>> 3     3    3
>> 4     0    0
>> 5     5    5
>> 6     6    6
>> 7     7    7
>> 8     0    0
>> 9     9    9
>> 10   10   10
>>
>> -------------------------------------
>> David L Carlson
>> Department of Anthropology
>> Texas A&M University
>> College Station, TX 77840-4352
>>
>> -----Original Message-----
>> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Ivan Calandra
>> Sent: Thursday, June 23, 2016 10:14 AM
>> To: R Help
>> Subject: Re: [R] Subscripting problem with is.na()
>>
>> Thank you Bert for this clarification. It is indeed an important point.
>>
>> Ivan
>>
>> --
>> Ivan Calandra, PhD
>> Scientific Mediator
>> University of Reims Champagne-Ardenne
>> GEGENAA - EA 3795
>> CREA - 2 esplanade Roland Garros
>> 51100 Reims, France
>> +33(0)3 26 77 36 89
>> ivan.calandra at univ-reims.fr
>> --
>> https://www.researchgate.net/profile/Ivan_Calandra
>> https://publons.com/author/705639/
>>
>> Le 23/06/2016 à 17:06, Bert Gunter a écrit :
>>> Sorry, Ivan, your statement is incorrect:
>>>
>>> "When you use a single bracket on a list with only one argument in
>>> between, then R extracts "elements", i.e. columns in the case of a
>>> data.frame. This explains your errors. "
>>>
>>> e.g.
>>>
>>>> ex <- data.frame(a = 1:3, b = letters[1:3])
>>>> a <- 1:3
>>>> identical(ex[1], a)
>>> [1] FALSE
>>>
>>>> class(ex[1])
>>> [1] "data.frame"
>>>> class(a)
>>> [1] "integer"
>>>
>>> Compare:
>>>
>>>> identical(ex[[1]], a)
>>> [1] TRUE
>>>
>>> Why? Single bracket extraction on a list results in a list; double
>>> bracket extraction results in the element of the list ( a "column" in
>>> the case of a data frame, which is a specific kind of list). The
>>> relevant sections of ?Extract are:
>>>
>>> "Indexing by [ is similar to atomic vectors and selects a **list** of
>>> the specified element(s).
>>>
>>> Both [[ and $ select a **single element of the list**. "
>>>
>>>
>>> Hope this clarifies this often-confused issue.
>>>
>>>
>>> Cheers,
>>> Bert
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Thu, Jun 23, 2016 at 7:34 AM, Ivan Calandra
>>> <ivan.calandra at univ-reims.fr> wrote:
>>>> My statement "Using a single bracket '[' on a data.frame does the same as
>>>> for matrices: you need to specify rows and columns" was not correct.
>>>>
>>>>
>>>> When you use a single bracket on a list with only one argument in between,
>>>> then R extracts "elements", i.e. columns in the case of a data.frame. This
>>>> explains your errors.
>>>>
>>>> But it is possible to use a single bracket on a data.frame with 2 arguments
>>>> (rows, columns) separated by a comma, as with matrices. This is the solution
>>>> you received.
>>>>
>>>> Ivan
>>>>
>>>>
>>>> --
>>>> Ivan Calandra, PhD
>>>> Scientific Mediator
>>>> University of Reims Champagne-Ardenne
>>>> GEGENAA - EA 3795
>>>> CREA - 2 esplanade Roland Garros
>>>> 51100 Reims, France
>>>> +33(0)3 26 77 36 89
>>>> ivan.calandra at univ-reims.fr
>>>> --
>>>> https://www.researchgate.net/profile/Ivan_Calandra
>>>> https://publons.com/author/705639/
>>>>
>>>> Le 23/06/2016 à 16:27, Ivan Calandra a écrit :
>>>>> Dear Georg,
>>>>>
>>>>> You need to learn a bit more about the subsetting methods, depending on
>>>>> the object structure you're trying to subset.
>>>>>
>>>>> More specifically, when you run this: ds_test[is.na(ds_test$var1)]
>>>>> you get this error: "Error in `[.data.frame`(ds_test, is.na(ds_test$var1))
>>>>> : undefined columns selected"
>>>>>
>>>>> This means that R does not understand which column you're trying to
>>>>> select. But you're actually trying to select rows.
>>>>>
>>>>> Using a single bracket '[' on a data.frame does the same as for matrices:
>>>>> you need to specify rows and columns, like this:
>>>>> ds_test[is.na(ds_test$var1), ] ## notice the last comma
>>>>> ds_test[is.na(ds_test$var1), ] <- 0 ## works on all columns because you
>>>>> didn't specify any after the comma
>>>>>
>>>>> If you want it only for "var1", then you need to specify the column:
>>>>> ds_test[is.na(ds_test$var1), "var1"] <- 0
>>>>>
>>>>> It's the same problem with your 2nd and 4th tries (4th one has other
>>>>> problems). Your 3rd try does not change ds_test at all.
>>>>>
>>>>> HTH,
>>>>> Ivan
>>>>>
>>>>> --
>>>>> Ivan Calandra, PhD
>>>>> Scientific Mediator
>>>>> University of Reims Champagne-Ardenne
>>>>> GEGENAA - EA 3795
>>>>> CREA - 2 esplanade Roland Garros
>>>>> 51100 Reims, France
>>>>> +33(0)3 26 77 36 89
>>>>> ivan.calandra at univ-reims.fr
>>>>> --
>>>>> https://www.researchgate.net/profile/Ivan_Calandra
>>>>> https://publons.com/author/705639/
>>>>>
>>>>> Le 23/06/2016 à 15:57, G.Maubach at weinwolf.de a écrit :
>>>>>> Hi All,
>>>>>>
>>>>>> I would like to recode my NAs to 0. Using a single vector everything is
>>>>>> fine.
>>>>>>
>>>>>> But if I use a data.frame things go wrong:
>>>>>>
>>>>>> -- cut --
>>>>>>
>>>>>> var1 <- c(1:3, NA, 5:7, NA, 9:10)
>>>>>> var2 <- c(1:3, NA, 5:7, NA, 9:10)
>>>>>> ds_test <-
>>>>>>     data.frame(var1, var2)
>>>>>>
>>>>>> test <- var1
>>>>>> test[is.na(test)] <- 0
>>>>>> test  # NA recoded OK
>>>>>>
>>>>>> # First try
>>>>>> ds_test[is.na(ds_test$var1)] <- 0  # duplicate subscripts WRONG
>>>>>>
>>>>>> # Second try
>>>>>> ds_test[is.na("var1")] <- 0
>>>>>> ds_test$var1  # not recoded WRONG
>>>>>>
>>>>>> # Third try: to me the most intuitive approach
>>>>>> is.na(ds_test["var1"]) <- 0  # attempt to select less than one element in
>>>>>> integerOneIndex WRONG
>>>>>>
>>>>>> # Fourth try
>>>>>> ds_test[is.na(var1)] <- 0  # duplicate subscripts for columns WRONG
>>>>>>
>>>>>> -- cut --
>>>>>>    How can I do it correctly?
>>>>>>
>>>>>> Where could I have found something about it?
>>>>>>
>>>>>> Kind regards
>>>>>>
>>>>>> Georg
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.


More information about the R-help mailing list