[R] non-intuitive behaviour after type conversion

Don MacQueen macq at llnl.gov
Mon Nov 23 18:41:50 CET 2009


When you attach() something, it loads it into memory and there it 
stays. It is not a link, reference, or pointer to the original. 
Changing the original (the version in the dataframe), which is what 
you did, does not change the attached copy in memory. In essence, you 
did a type conversion on one copy, but afterwards started looking at 
the other copy.

See also an interjected comments below.

-Don

At 8:54 AM +0000 11/23/09, Alan Kelly wrote:
>Deal list,
>I have a data frame (birth) with mixed variables (numeric and 
>alphanumeric).  One variable "t1stvisit" was originally coded as 
>numeric with values 1,2, and 3.  After attaching the data frame, 
>this  
>is what I see when I use str(t1stvisit)
>
>$ t1stvisit: int  1 1 1 1 1 1 1 1 2 2 ...
>
>This is as expected.
>I then convert t1stvisit to a factor and to avoid creating a second 
>copy of this variable independent of the data frame I use:
>birth$t1stvisit = as.factor(birth$t1stvisit)
>if I check that the conversion has worked:
>is.factor(t1stvisit)
>[1] FALSE
>Now the only object present in the workspace in the data frame 
>"birth" and, as noted,  I have not created any new variables. So why 
>does R still treat t1stvisit as numeric?
>is.factor(t1stvisit)
>[1] FALSE
>
>Yet when I try the following:
>>  is.factor(birth$t1stvisit)
>[1] TRUE
>So, there appears to be two versions of "t1stvisit"  - the original 
>numeric version and the correct factor version although ls() only 
>shows "birth" as present in the workspace.

Right.
   find('t1stvisit')
will show you there are two of them, and where in memory they are located.
If you type
    t1stvisit
at the prompt, you always get the first one. The one in the attached 
dataframe is the second one. Use the
   search()
function to show you the different locations in memory where objects 
can be found.

When you did the attach(), did you get a message like:

>  attach(tmp)

         The following object(s) are masked _by_ .GlobalEnv :

          x

(yours would have referred to your variables, not the "x" in my example).
That message tells you you have two variables of the same name, 
stored in two different locations in the search path.

As a general rule, it's just plain confusing to have more than one 
object of the same name in more than one location. In your situation, 
I would get rid of the one that's not in the dataframe. But even 
then, if you change it in the dataframe you'll still need to detach 
and re-attach the dataframe, so using attach() is probably not the 
best choice in the long run. Maybe the with() function would meet 
your needs.

>If I type:
>>  summary(t1stvisit)
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
>   1.000   1.000   2.000   1.574   2.000   3.000  29.000
>I get the numeric version, but if I try
>summary(birth$t1stvisit)
>    1    2    3 NA's
>  180  169   22   29
>I get the factor version.
>
>Frankly I feel that this behaviour is non-intuitive and potentially 
>problematic. Nor have I seen warnings about this in the various text 
>books on R.
>Can anyone comment on why this should occur?
>Many thanks,
>Alan Kelly
>
>Dr. Alan Kelly
>Department of Public Health & Primary Care
>Trinity College Dublin
>
>______________________________________________
>R-help at r-project.org mailing list
>https://*stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://*www.*R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.


-- 
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062




More information about the R-help mailing list