[R] Strange result when subsetting a data frame based on a character variable

Thierry Onkelinx thierry.onkelinx at inbo.be
Tue Nov 17 21:40:08 CET 2015


Dear Duncan,

I'd rather convert the numeric to character. E.g. with sprintf() or
format() in case it is a numeric vector.

subset(Data, group == "100000")
subset(Data, group == sprintf("%.f", 100000))

sprintf("%.f", 100000) # "100000"

It requires the user to think about the format, which can reduce errors.

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2015-11-17 21:27 GMT+01:00 Duncan Murdoch <murdoch.duncan op gmail.com>:

> On 17/11/2015 2:25 PM, Duncan Murdoch wrote:
>
>> On 17/11/2015 2:14 PM, Karl Schilling wrote:
>> > Dear all,
>> >
>> > I have one observation that I do not quite understand. Maybe someone
>> > can clarify this issue for me.
>> >
>> > I have a data frame which I want to subset based on a grouping variable,
>> > say "group". Actually, "group" is a numeric value, but it is saved as a
>> > character. I give some code to generate an exemplary data frame below.
>> >
>> > Now, if I use
>> >
>> > MySubset <- subset(Data, Data$group == "..")
>> >
>> > everything works fine, as expected. ".." stands here for the value of
>> > group given as a character string.
>> >
>> > Surprisingly, I also get a correct subsetting if I simply give the plain
>> > numeric value of group (like MySubset <- subset(Data, Data$group == ..),
>> > AS LONG AS this numeric value is less then 100000.
>> >
>> > If the numeric value is 100000 or larger, I get an empty subset.
>> >
>> > OK, I know how to avoid this situation, but I wonder what the
>> > explanation for this for me rather strange behavior might be.
>> >
>> > Thank you so much for your suggestions.
>>
>> If you are comparing a character value to a numeric value, the numeric
>> value is converted to character using as.character() for the
>> comparison.  as.character(100000) or a larger number is likely not
>> "100000"; try it.  (With the options I have on my
>> computer, I get "1e+05".)
>>
>> If you want a numeric comparison, be explicit:
>>
>> subset(Data, as.numeric(Data$group) == ..)
>>
>
> This might be bad advice.  If Data$group is a factor (as it tends to be
> when character data is put in a dataframe), this will use the underlying
> factor code, not the visible one.  You need to use
>
> as.numeric(as.character(Data$group))
>
> to do the conversion you probably want.
>
> Duncan Murdoch
>
>
>>
>> Duncan Murdoch
>>
>> >
>> >
>> > Karl Schilling
>> >
>> >
>> > #####
>> > Exemplary code for reproducing the above described problem:
>> >
>> > options(stringsAsFactors = F)
>> >
>> > # set up some data frame
>> > value <- c(1:6)
>> > group <- rep(c("20000", "99999", "100000"), each = 2)
>> > Data <- data.frame(value = value, group = group)
>> > str(Data)
>> >
>> > # subset data frame based on the value of the variable "group",
>> > # treating this value once as a character, and once as a number:
>> >
>> > Data20 <- subset(Data, Data$group =="20000")
>> > str(Data20)
>> > Data20N <- subset(Data, Data$group ==20000)
>> > str(Data20N)
>> >
>> >
>> > Data99 <- subset(Data, Data$group =="99999")
>> > str(Data99)
>> > Data99N <- subset(Data, Data$group ==99999)
>> > str(Data99N)
>> > Data100 <- subset(Data, Data$group =="100000")
>> > str(Data100)
>> > Data100N <- subset(Data, Data$group ==100000)
>> > str(Data100N)
>> >
>>
>>
> ______________________________________________
> R-help op r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list