[R] nmax parameter in factor function

peter dalgaard pdalgd at gmail.com
Sun Jun 4 09:33:36 CEST 2017


No anomaly, it is just that you need to know what it is for, before trying to use it. 

Basically, duplicated() works by looking up entries in a hash table (for which there is a substantial literature, just google it). This will be somewhat more efficient if you know the number of  unique values in advance (otherwise the table is the same size as the input vector), so you have the option of setting nmax. If you set nmax too small, you get to keep both pieces. 

nmax is directly linked to a variable in C code, and I expect that 0-based indexing is the reason that nmax can be one less than the actual number of unique values.

-pd  

> On 4 Jun 2017, at 06:35 , Bert Gunter <bgunter.4567 at gmail.com> wrote:
> 
> I'll go just a bit "fer-er." It appears the anomaly -- I hesitate to
> call it a bug -- is in the C code for duplicated.default():
> 
>> duplicated(letters[1:10],nmax=10)
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> 
>> duplicated(letters[1:10],nmax=9)
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> 
>> duplicated(letters[1:10],nmax=8) ## for all nmax <9
> Error in duplicated.default(letters[1:10], nmax = 8) : hash table is full
> 
> Cleverer folks than I must now explain (and possibly correct me).
> 
> Cheers,
> Bert
> 
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> 
> On Sat, Jun 3, 2017 at 9:11 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>> Well, you won't like this, but it is kind of wimpily (is that a word?)
>> documented:
>> 
>> If you check the code of factor(), you will see that nmax appears as
>> an argument in a call to unique(). ?unique says for nmax, "... see
>> duplicated" . And ?duplicated says:
>> 
>> "If nmax is set too small there is liable to be an error: nmax = 1 is
>> silently ignored."
>> 
>> So sometimes you get an error when nmax is too small with the hash
>> table error message; and sometimes you just apparently get the nmax
>> argument ignored:
>> 
>>> identical(factor(letters,nmax = 25), factor(letters,nmax=26))
>> [1] TRUE
>> 
>> and that, to paraphrase what Roger Hammerstein said about Kansas City,
>> is about "as fer as I can go."
>> 
>> (http://lyricsplayground.com/alpha/songs/e/everythingsuptodateinkansascity.shtml)
>> 
>> Cheers,
>> Bert
>> 
>> 
>> 
>> Bert Gunter
>> 
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>> 
>> 
>> On Sat, Jun 3, 2017 at 6:14 PM, Ramnik Bansal <ramnik.bansal at gmail.com> wrote:
>>> I have been trying to understand how the argument 'nmax' works in
>>> 'factor' function. R-Documentation states - "Since factors typically
>>> have quite a small number of levels, for large vectors x it is helpful
>>> to supply nmax as an upper bound on the number of unique values."
>>> 
>>> In the code below what is the reason for error when value of nmax is
>>> 24. Why did the same error not occur with nmax = 25  and also how come
>>> there are 26 levels when nmax = 25 ?
>>> 
>>>> factor(x = letters, nmax = 26)
>>> [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
>>> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
>>> 
>>>> factor(x = letters, nmax = 25)
>>> [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
>>> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
>>> 
>>>> factor(x = letters, nmax = 24)
>>> Error in unique.default(x, nmax = nmax) : hash table is full
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list