[Rd] Suggestions to speed up median() and has.na()
Duncan Murdoch
murdoch at stats.uwo.ca
Tue Apr 11 02:54:39 CEST 2006
On 4/10/2006 8:08 PM, Thomas Lumley wrote:
> On Mon, 10 Apr 2006, Duncan Murdoch wrote:
>
>> On 4/10/2006 7:22 PM, Thomas Lumley wrote:
>>> On Mon, 10 Apr 2006, Henrik Bengtsson wrote:
>>>
>>>> Suggestion 2:
>>>> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
>>>> as soon as a NA value is detected. In the best case it returns after
>>>> the first index with TRUE, in the worst case it returns after the last
>>>> index N with FALSE. The cost for is.na(x) is always O(N), and any()
>>>> in the best case O(1) and in the worst case O(N) (if any() is
>>>> implemented as I hope). An has.na() function would be very useful
>>>> elsewhere too.
>>> This sounds useful (though it has missed the deadline for 2.3.0).
>>>
>>> It won't help if the typical case is no missing values, as you suggest, but
>>> it will be faster when there are missing values.
>> I think it would help even in that case if the vector is large, because it
>> avoids allocating and disposing of the logical vector of the same length as
>> x.
>
> That makes sense. I have just tried, and for vectors of length ten
> million it does make a measurable difference.
>
>
>>>> BTW, without having checked the source code, it looks like is.na() is
>>>> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
>>>> a vector without NAs. On the other hand, is.na(sum(x)) becomes
>>>> awfully slow if 'x' contains NAs.
>>>>
>>> I don't think it is unnecessarily slow. It has to dispatch methods and it
>>> has to make sure that matrix structure is preserved. After that the code
>>> is just
>>>
>>> case REALSXP:
>>> for (i = 0; i < n; i++)
>>> LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
>>> break;
>>>
>>> and it's hard to see how that can be improved. It does suggest that a
>>> faster anyNA() function would have to not be generic.
>> If it's necessary to make it not generic to achieve the speedup, I don't
>> think it's worth doing. If anyNA is written not to be generic I'd guess a
>> very common error will be to apply it to a dataframe and get a misleading
>> "FALSE" answer. If we do that, I predict that the total amount of r-help
>> time wasted on it will exceed the CPU time saved by orders of magnitude.
>>
>
> I wasn't proposing that it should be stupid, just not generic. It could
> support data frames (sum(), does, for example). If it didn't support data
> frames it should certainly give an error rather than the wrong answer, but
> if we are seriously trying to avoid delays around 0.1 seconds then going
> through the generic function mechanism may be a problem.
If it's not dataframes, it will be something else. I think it's highly
desirable that any(is.na(x)) == anyNA(x) within base packages, and we
should make it straightforward to maintain this identity in contributed
packages.
By the way, I think Bill's suggestion of calling it anyMissing makes a
lot of sense.
Duncan
More information about the R-devel
mailing list