[R] 'format' behaviour in a 'apply' call depending on 'options(digits = K)'

Thu Aug 1 20:08:57 CEST 2013

Hi Mathieu,

I don't have a full explanation, but here is some additional observations:

> options(digits = 4)
>
> ## Simplified example
> df2 <- data.frame(x = rnorm(21), y = rnorm(21), id = 99990:100010)
> apply(df2, 1, function(dfi) format(dfi["id"], scientific = FALSE))
 [1] "99990"  "99991"  "99992"  "99993"  "99994"  " 99995" " 99996" "
99997" " 99998" " 99999" "100000" "100001" "100002" "100003"
[15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
>
> ## Based on magnitude of id (> 9994 gets padded regardless of position)
> df2 <- data.frame(x = rnorm(21), y = rnorm(21), id = 100010:99990)
> apply(df2, 1, function(dfi) format(dfi["id"], scientific = FALSE))
 [1] "100010" "100009" "100008" "100007" "100006" "100005" "100004"
"100003" "100002" "100001" "100000" " 99999" " 99998" " 99997"
[15] " 99996" " 99995" "99994"  "99993"  "99992"  "99991"  "99990"
>
> ## The issue is that formatting a double leads to the originally noted behavior.
> ## The apply version coerces df2 to a matrix of type double which is why this
> ## happens there as well.
>
> for(i in 1:nrow(df2)) print(format(df2[i, "id"], scientific=FALSE))
[1] "100010"
[1] "100009"
[1] "100008"
[1] "100007"
[1] "100006"
[1] "100005"
[1] "100004"
[1] "100003"
[1] "100002"
[1] "100001"
[1] "100000"
[1] "99999"
[1] "99998"
[1] "99997"
[1] "99996"
[1] "99995"
[1] "99994"
[1] "99993"
[1] "99992"
[1] "99991"
[1] "99990"
> for(i in 1:nrow(df2)) print(format(as.double(df2[i, "id"]), scientific=FALSE))
[1] "100010"
[1] "100009"
[1] "100008"
[1] "100007"
[1] "100006"
[1] "100005"
[1] "100004"
[1] "100003"
[1] "100002"
[1] "100001"
[1] "100000"
[1] " 99999"
[1] " 99998"
[1] " 99997"
[1] " 99996"
[1] " 99995"
[1] "99994"
[1] "99993"
[1] "99992"
[1] "99991"
[1] "99990"

Best,
Ista

On Thu, Aug 1, 2013 at 11:31 AM, Mathieu Basille
<basille.web at ase-research.org> wrote:
> This problem does not seem to be widely popular, but at least affects two
> users (both on Linux, maybe a hint here?). To me, it looks like a bug (is it
> a R bug, or a OS-related bug, I don't know). Should I forward it to R-devel,
> or some other place where R gurus may have a chance to look at it?
>
> Mathieu.
>
>
> Le 07/30/2013 02:34 PM, arun a écrit :
>
>> Hi Mathieu
>> yes, the original problem occurs in my system too. I am using R 3.0.1 on
>> linux mint 15.  I guess the default case would be trim=FALSE, but still it
>> looks very strange especially in ?apply(), as it starts from " 99995"
>> onwards.
>>
>> sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>>   [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>>   [7] LC_PAPER=C                 LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] stringr_0.6.2  reshape2_1.2.2
>>
>> loaded via a namespace (and not attached):
>> [1] plyr_1.8    tools_3.0.1
>>
>>
>>
>>
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Mathieu Basille <basille.web at ase-research.org>
>> To: arun <smartpink111 at yahoo.com>
>> Cc: R help <r-help at r-project.org>
>> Sent: Tuesday, July 30, 2013 2:29 PM
>> Subject: Re: [R] 'format' behaviour in a 'apply' call depending on
>> 'options(digits = K)'
>>
>> Thanks Arun for your answer. 'trim = TRUE' does indeed solve the symptoms
>> of the problem, and this is the solution I'm currently using. However, it
>> does not help to understand what the problem is, and what is the cause of
>> it.
>>
>> Can you confirm that the original problem also occurs on your computer
>> (and
>> what is your OS)? It would be interesting since David is not able to
>> reproduce the problem with Mac OS X.
>> Mathieu.
>>
>>
>> Le 07/30/2013 02:15 PM, arun a écrit :
>>>
>>> Hi,
>>> Try using trim=TRUE, in ?format()
>>> options(digits=4)
>>>
>>> df2 <- data.frame(x = rnorm(110000), y = rnorm(110000), id = 1:110000)
>>>     df2$id2 <- apply(df2, 1, function(dfi) format(dfi["id"],
>>> trim=TRUE,scientific = FALSE))
>>>      df2$id2[99990:100010]
>>> # [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>> "99997"
>>> # [9] "99998"  "99999"  "100000" "100001" "100002" "100003" "100004"
>>> "100005"
>>> #[17] "100006" "100007" "100008" "100009" "100010"
>>>
>>>
>>> id2 <- format(1:110000, scientific = FALSE,trim=TRUE)
>>> id2[99990:100010]
>>> # [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>> "99997"
>>>     #[9] "99998"  "99999"  "100000" "100001" "100002" "100003" "100004"
>>> "100005"
>>> #[17] "100006" "100007" "100008" "100009" "100010"
>>> A.K.
>>>
>>>
>>> ----- Original Message -----
>>> From: Mathieu Basille <basille.web at ase-research.org>
>>> To: David Winsemius <dwinsemius at comcast.net>
>>> Cc: r-help at r-project.org
>>> Sent: Tuesday, July 30, 2013 2:07 PM
>>> Subject: Re: [R] 'format' behaviour in a 'apply' call depending on
>>> 'options(digits = K)'
>>>
>>> Thanks David for your interest. I have to admit that your answer puzzles
>>> me
>>> even more than before. It seems that the underlying problem is way beyond
>>> my R skills...
>>>
>>> The generation of id2 is indeed quite demanding, especially compared to a
>>> simple 'as.character' call. Anyway, since it seems to be system specific,
>>> here is the sessionInfo() that I forgot to attach to my first message:
>>>
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>
>>> locale:
>>>      [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C
>>>      [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8
>>>      [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8
>>>      [7] LC_PAPER=C                 LC_NAME=C
>>>      [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> In brief: last stable R available under Debian Testing... Hopefully this
>>> can help tracking down the problem.
>>> Mathieu.
>>>
>>>
>>> Le 07/30/2013 01:58 PM, David Winsemius a écrit :
>>>>
>>>>
>>>> On Jul 30, 2013, at 9:01 AM, Mathieu Basille wrote:
>>>>
>>>>> Dear list,
>>>>>
>>>>> Here is a simple example in which the behaviour of 'format' does not
>>>>> make sense to me. I have read the documentation and searched the archives,
>>>>> but nothing pointed me in the right direction to understand this behaviour.
>>>>> Let's start with a simple data frame:
>>>>>
>>>>> df1 <- data.frame(x = rnorm(110000), y = rnorm(110000), id = 1:110000)
>>>>>
>>>>> Let's now create a new variable 'id2' which is the character
>>>>> representation of 'id'. Note that I use 'scientific = FALSE' to ensure that
>>>>> long numbers such as 100,000 are not formatted using their scientific
>>>>> representation (in this case 1e+05):
>>>>>
>>>>> df1$id2 <- apply(df1, 1, function(dfi) format(dfi["id"], scientific =
>>>>> FALSE))
>>>>>
>>>>> Let's have a look at part of the result:
>>>>>
>>>>> df1$id2[99990:100010]
>>>>> [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>>>> [8] "99997"  "99998"  "99999"  "100000" "100001" "100002" "100003"
>>>>> [15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
>>>>
>>>>
>>>> Some formating processes are carried out by system functions. In this
>>>> case I am unable to reproduce with the same code on a Mac OS 10.7.5/R 3.0.1
>>>> Patched
>>>>
>>>>> df1$id2[99990:100010]
>>>>
>>>>      [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>>> "99997"
>>>>      [9] "99998"  "99999"  "100000" "100001" "100002" "100003" "100004"
>>>> "100005"
>>>> [17] "100006" "100007" "100008" "100009" "100010"
>>>>
>>>> (I did notice that generation of the id2 variable seemed to take an
>>>> inordinately long time.)
>>>>
>>>> -- David.
>>>>>
>>>>>
>>>>> So far, so good. Let's now play with the 'digits' option:
>>>>>
>>>>> options(digits = 4)
>>>>> df2 <- data.frame(x = rnorm(110000), y = rnorm(110000), id = 1:110000)
>>>>> df2$id2 <- apply(df2, 1, function(dfi) format(dfi["id"], scientific =
>>>>> FALSE))
>>>>> df2$id2[99990:100010]
>>>>> [1] "99990"  "99991"  "99992"  "99993"  "99994"  " 99995" " 99996"
>>>>> [8] " 99997" " 99998" " 99999" "100000" "100001" "100002" "100003"
>>>>> [15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
>>>>>
>>>>> Notice the extra leading space from 99995 to 99999? To make sure it
>>>>> only happened there:
>>>>>
>>>>> df2$id2[which(df1$id2 != df2$id2)]
>>>>> [1] " 99995" " 99996" " 99997" " 99998" " 99999"
>>>>>
>>>>> And just to make sure it only occurs in a 'apply' call, here is the
>>>>> same directly on a numeric vector:
>>>>>
>>>>> id2 <- format(1:110000, scientific = FALSE)
>>>>> id2[99990:100010]
>>>>> [1] " 99990" " 99991" " 99992" " 99993" " 99994" " 99995" " 99996"
>>>>> [8] " 99997" " 99998" " 99999" "100000" "100001" "100002" "100003"
>>>>> [15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
>>>>>
>>>>> Here the leading spaces are for every number, which makes sense to me.
>>>>> Is there anything I'm misinterpreting in the behaviour of 'format'?
>>>>> Thanks in advance for any hint,
>>>>> Mathieu.
>>>>>
>>>>>
>>>>> PS: Some background for this question. It all comes from a Rmd
>>>>> document, that knitr consistently failed to process, while the R code was
>>>>> fine using batch or interactive R. knitr uses 'options(digits = 4)' as
>>>>> opposed to 'options(digits = 7)' by default in R, which made one of my
>>>>> function throw an error with knitr, but not with batch or interactive R. I
>>>>> managed to solve the problem using 'trim = TRUE' in 'format', but I still do
>>>>> not understand what's going on...
>>>>> If you're interested, see here for more details on the original
>>>>> problem:
>>>>> http://stackoverflow.com/questions/17866230/knitr-vs-interactive-r-behaviour/17872176
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> ~$ whoami
>>>>> Mathieu Basille, PhD
>>>>>
>>>>> ~$ locate --details
>>>>> University of Florida \\
>>>>> Fort Lauderdale Research and Education Center
>>>>> (+1) 954-577-6314
>>>>> http://ase-research.org/basille
>>>>>
>>>>> ~$ fortune
>>>>> « Le tout est de tout dire, et je manque de mots
>>>>> Et je manque de temps, et je manque d'audace. »
>>>>> -- Paul Éluard
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>> David Winsemius
>>>> Alameda, CA, USA
>>>>
>>>
>>>
>>>
>>>>
>>>> On Jul 30, 2013, at 9:01 AM, Mathieu Basille wrote:
>>>>
>>>>> Dear list,
>>>>>
>>>>> Here is a simple example in which the behaviour of 'format' does not
>>>>> make sense to me. I have read the documentation and searched the archives,
>>>>> but nothing pointed me in the right direction to understand this behaviour.
>>>>> Let's start with a simple data frame:
>>>>>
>>>>> df1 <- data.frame(x = rnorm(110000), y = rnorm(110000), id = 1:110000)
>>>>>
>>>>> Let's now create a new variable 'id2' which is the character
>>>>> representation of 'id'. Note that I use 'scientific = FALSE' to ensure that
>>>>> long numbers such as 100,000 are not formatted using their scientific
>>>>> representation (in this case 1e+05):
>>>>>
>>>>> df1$id2 <- apply(df1, 1, function(dfi) format(dfi["id"], scientific =
>>>>> FALSE))
>>>>>
>>>>> Let's have a look at part of the result:
>>>>>
>>>>> df1$id2[99990:100010]
>>>>> [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>>>> [8] "99997"  "99998"  "99999"  "100000" "100001" "100002" "100003"
>>>>> [15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
>>>>
>>>>
>>>> Some formating processes are carried out by system functions. In this
>>>> case I am unable to reproduce with the same code on a Mac OS 10.7.5/R 3.0.1
>>>> Patched
>>>>
>>>>> df1$id2[99990:100010]
>>>>
>>>>       [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>>> "99997"
>>>>       [9] "99998"  "99999"  "100000" "100001" "100002" "100003" "100004"
>>>> "100005"
>>>> [17] "100006" "100007" "100008" "100009" "100010"
>>>>
>>>> (I did notice that generation of the id2 variable seemed to take an
>>>> inordinately long time.)
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.