[Rd] sd(NA)

Tue Dec 4 02:39:34 CET 2007

I also prefer the old behavior.  The old behavior of sd (return NA 
rather than stop with an error) is nicer when one is working with any 
kind of resampling technique.  If there are some NA's in the data, then 
one can happily "debug" with a small or medium number of samples, and 
only when running a full resample will one get a sample containing all 
NA's, which triggers the error and aborts the whole computation (and of 
course the times this caused the loss of several hours of computation by 
happening close to the end are most easily remembered.)

In R-devel (2.7.0), the following behavior occurs with various 
summary/statistics functions when given a vector of all NA (& na.rm=T):

sd, var: stop with error
mean, mad, median, IQR, quantile, fivenum: return NA
min, max, range: warn & return NA

Is the "stop with error" behavior really that useful with sd() & var() 
that these functions should differ in their behavior from mean(), mad(), 
etc?

Personally, I'd find it most convenient if all these functions just 
returned NA values (without warning) when unable to compute a value.

-- Tony Plate

[resent because http://www.orbitrbl.com/ is blocking emails from one of 
my ISPs]

Patrick Burns wrote:
> I like the 2.6.x behaviour better.  Consider:
>
> x <- array(1:30), c(10,3))
> x[,1] <- NA
> x[-1,2] <- NA
> x[1,3] <- NA
>
> sd(x, na.rm=TRUE)
>
> # 2.7.0
> Error in var(x, na.rm = na.rm) : no complete element pairs
>
> # 2.6.x
> [1]       NA       NA 2.738613
>
> The reason to put 'na.rm=TRUE' into the call is to avoid
> getting an error due to missing values. (And, yes, in finance
> it is entirely possible to have a matrix with all NAs in a
> column.)
>
> I think the way out is to allow there to be a conceptual
> difference between computing a value with no data, and
> computing a value on all NAs after removing NAs.  The
> first is clearly impossible.  The second has some actual
> value, but we don't have enough information to have an
> estimate of the value.
>
> Patrick Burns
> patrick at burns-stat.com
> +44 (0)20 8525 0696
> http://www.burns-stat.com
> (home of S Poetry and "A Guide for the Unwilling S User")
>
> Prof Brian Ripley wrote:
>
>   
>> On Sun, 2 Dec 2007, Wolfgang Huber wrote:
>>
>>  
>>
>>     
>>> Dear Prof. Ripley
>>>
>>> I noted a change in the behaviour of "cov", which is very reasonable:
>>>
>>> ## R version 2.7.0 Under development (unstable) (2007-11-30 r43565)
>>>    
>>>
>>>       
>>>> cov(as.numeric(NA), as.numeric(NA), use="complete.obs")
>>>>      
>>>>
>>>>         
>>> Error in cov(as.numeric(NA), as.numeric(NA), use = "complete.obs") :
>>>  no complete element pairs
>>>
>>> whereas earlier behavior was, for example:
>>> ## R version 2.6.0 Patched (2007-10-23 r43258)
>>>    
>>>
>>>       
>>>> cov(as.numeric(NA), as.numeric(NA), use="complete.obs")
>>>>      
>>>>
>>>>         
>>> [1] NA
>>>
>>>
>>> I wanted to ask whether the effect this has on "sd" is desired:
>>>
>>> ## R version 2.7.0 Under development (unstable) (2007-11-30 r43565)
>>>    
>>>
>>>       
>>>> sd(NA, na.rm=TRUE)
>>>>      
>>>>
>>>>         
>>> Error in var(x, na.rm = na.rm) : no complete element pairs
>>>
>>> ## R version 2.6.0 Patched (2007-10-23 r43258)
>>>    
>>>
>>>       
>>>> sd(NA, na.rm=TRUE)
>>>>      
>>>>
>>>>         
>>> [1] NA
>>>    
>>>
>>>       
>> That is a bug fix: see the NEWS entry.  The previous behaviour of
>>
>>  
>>
>>     
>>> sd(numeric(0))
>>>    
>>>
>>>       
>> Error in var(x, na.rm = na.rm) : 'x' is empty
>>  
>>
>>     
>>> sd(NA_real_, na.rm=TRUE)
>>>    
>>>
>>>       
>> [1] NA
>>
>> was not as documented:
>>
>>      This function computes the standard deviation of the values in
>>      'x'. If 'na.rm' is 'TRUE' then missing values are removed before
>>      computation proceeds.
>>
>> so somehow an empty vector had a sd() if computed one way, and not if 
>> computed another.
>>
>>  
>>
>>     
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>