[Rd] Suggestions to speed up median() and has.na()

Thomas Lumley tlumley at u.washington.edu
Tue Apr 11 02:08:02 CEST 2006


On Mon, 10 Apr 2006, Duncan Murdoch wrote:

> On 4/10/2006 7:22 PM, Thomas Lumley wrote:
>> On Mon, 10 Apr 2006, Henrik Bengtsson wrote:
>> 
>>> Suggestion 2:
>>> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
>>> as soon as a NA value is detected.  In the best case it returns after
>>> the first index with TRUE, in the worst case it returns after the last
>>> index N with FALSE.  The cost for is.na(x) is always O(N), and any()
>>> in the best case O(1) and in the worst case O(N) (if any() is
>>> implemented as I hope).  An has.na() function would be very useful
>>> elsewhere too.
>> 
>> This sounds useful (though it has missed the deadline for 2.3.0).
>> 
>> It won't help if the typical case is no missing values, as you suggest, but 
>> it will be faster when there are missing values.
>
> I think it would help even in that case if the vector is large, because it 
> avoids allocating and disposing of the logical vector of the same length as 
> x.

That makes sense. I have just tried, and for vectors of length ten 
million it does make a measurable difference.


>>> BTW, without having checked the source code, it looks like is.na() is
>>> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
>>> a vector without NAs.  On the other hand, is.na(sum(x)) becomes
>>> awfully slow if 'x' contains NAs.
>>> 
>> 
>> I don't think  it is unnecessarily slow.  It has to dispatch methods and it 
>> has to make sure that matrix structure is preserved.  After that the code 
>> is just
>>
>>      case REALSXP:
>>          for (i = 0; i < n; i++)
>>              LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
>>          break;
>> 
>> and it's hard to see how that can be improved. It does suggest that a 
>> faster anyNA() function would have to not be generic.
>
> If it's necessary to make it not generic to achieve the speedup, I don't 
> think it's worth doing.  If anyNA is written not to be generic I'd guess a 
> very common error will be to apply it to a dataframe and get a misleading 
> "FALSE" answer.  If we do that, I predict that the total amount of r-help 
> time wasted on it will exceed the CPU time saved by orders of magnitude.
>

I wasn't proposing that it should be stupid, just not generic.  It could 
support data frames (sum(), does, for example). If it didn't support data 
frames it should certainly give an error rather than the wrong answer, but 
if we are seriously trying to avoid delays around 0.1 seconds then going 
through the generic function mechanism may be a problem.


 	-thomas



More information about the R-devel mailing list