[Rd] [External] Vector underflow [-1] in sort(method="radix", na.last=NA)
Benjamin Schwendinger
benj@m|n@chwe @end|ng |rom gm@||@com
Sat Dec 27 03:54:00 CET 2025
Hello! data.table diverged over 7 years ago from this code when Matt
rewrote forder.
> This code was originally contributed by data.table. I believe Michael
> Lawrence handled the integration at the time. There were a number of
> issues like this early on that were resolved on the R side and I
> believe contributed back to data.table. If you have the energy it
> might be good to compare the two now and see if there are things that
> should be ported from one to the other.
The original code had the following note which is also mentioned
multiple times but not explicitly part of radixsort.c
>> Note on challenges implementing `nalast=NA` so that things are under the same perspective when looked at later. To go to the point, there are four challenges in implementing `nalast=NA` with the current form of forder.c (which is excellent!):
>>
>> 1) Handling the i|d|csorted routines for nalast=NA:
>> The current implementation of nalast=NA in *sorted routine is that if all values are NAs, then it returns -2, a different value, so that all values of o/osub can be replacedw with 0. If any values are NA, then we assume that this vector is unsorted and return 0 to be taken care of by the corresponding sort routine. Now, this may very well be questioned/improved. For ex: we could detect if the vector is sorted and also if the first value is NA, and if so, get the group size of these NAs and replace that many indices with 0. This is not yet done, but could very well be implemented later. If there are no NAs in the vector, then things are the same.
>>
>> 2) Handling cases where n==1 and n==2:
>> The current sorting routines don't take care of n<=2, and rightly so, as they can be dealt with directly. But it's not the case for nalast=NA because, when n==2 and all are NAs, point 1 will handle it. But when n==1 and it is NA and the column is not the first column to order upon, with the current implementation, "thisgrpn" is checked for "== 1" and if so, just pushes that group size and continues.. But we've to replace it with 0 in our case. So, nalast==0 is checked for inside that if statement and if so, if the value is NA is checked for all types and if so, replaced with 0. For n==2 and "any" is NA, we've to introduce a special case for this in each sort routine to take care of and not result in an error (which tells this should be already taken care of in *sorted).
>>
>> 3) o[0] and newo[0] = -1:
>> Since nalast already populates NAs with 0 values in o, we can't use the old value of 0 to check if it's already populated or not. Therefore this has been changed to -1.
>>
>> 4) default cases are untouched as much as possible:
>> In order to not compromise in speed for default case (`setkey`), iinsert and dinsert are not touched at all, which means, we've to make sure that they are not run when n<200 and nalast==0, as they don't replace o[x=NA] with 0. And, 'icount', 'iradix', 'dradix' are modified to take care of nalast=NA, to my knowledge.
However, case 2) does not cater for Ivans identified corner case
example of order(NA_character_, "c", method = "radix", na.last = NA),
which is not tied to n==1 or n==2 but rather to thisgrpn==1 and
x=c(NA_character_, another_string).
Hence, the missing case is indeed to cater for o[i] = 0 with Ivans
1-liner or more explicitly via
--- src/main/radixsort.c
@@ -1766,7 +1766,12 @@
// this edge case had to be taken care of
// here.. (see the bottom of this file for
// more explanation)
- switch (TYPEOF(x)) {
+ if (o[i] == 0) {
+ isSorted = false;
+ i++;
+ push(1);
+ continue;
+ } switch (TYPEOF(x)) {
case INTSXP:
if (INTEGER(x)[o[i] - 1] == NA_INTEGER) {
isSorted = false;
Best, Ben
[[alternative HTML version deleted]]
More information about the R-devel
mailing list