[R] Is this a bug or am I making a mistake?

Sun Jan 12 21:40:59 CET 2014

On Jan 6, 2014, at 11:16 AM, Walter Anderson wrote:

> On 01/06/2014 11:14 AM, Sarah Goslee wrote:
>> Hi Walter,
>> 
>> I can't reproduce your results. Please provide some data that
>> demonstrates the problem.
>> 
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> 
>> subset() and [ differ in their handling of NA values, and you don't
>> need the dd$ in the arguments to subset().
>> 
>> But those don't explain your result given the information provided.
>> Please provide more information.
>> 
>> Sarah
>> 
>> 
>> On Mon, Jan 6, 2014 at 12:06 PM, Walter Anderson <wandrson01 at gmail.com> wrote:
>>> I have a data frame that I am extracting some records from and noticed the
>>> following issue
>>> 
>>> I originally used tmp <- subset(dd, dd$EVYEAR==2012 & dd$EVMONTH=='02')
>>> 
>>> and noticed that I wasn't ending up with all of the records I should have;
>>> however, when I used
>>> 
>>> tmp <- dd[dd$EVYEAR==2012 & dd$EVMONTH=='02',]
>>> 
>>> I did get all of the records I should have.
>>> 
>>> I thought the two forms were equivalent, am I mistaken?
>>> 
> Thanks everyone for the response.  I didn't provide a reproducible test, since the data I experienced this issue with was   quite large (> 40MB) and I have not been able to reproduce the problem with any other data set.  I have also performed the subset using Microsoft Access on the original dbf file I use for the data frame and confirmed that the second query format (dd[QUERY,]) is producing the correct results.  It doesn't appear that any of the impacted (or any in the data frame) contain NA records.

What does it mean to say "it doesn't appear that any of the impacted (or any in the data frame) contain NA records"? Where is the code and output to support that "appearance".

What does this show?

table( is.na(dd$EVYEAR==2012, is.na(dd$EVMONTH=='02') )

The other difference between "[" and subset is that drop=FALSE in `subset` although how that would affect results is not clear.

> 
> I am not really looking for any particular solution, but was surprised by the different results from what I presumed to be the same query.  If it is believed to be a possible bug, I would be glad to package up the data that is generating the issue, but not sure where to place such a large data set.

I don't think you have yet demonstrated a bug.

-- 

David Winsemius
Alameda, CA, USA