[R] problem selecting rows meeting a criterion

Steve Lianoglou mailinglist.honeypot at gmail.com
Tue Aug 11 21:26:40 CEST 2009


Hi,

See comments in line:

On Aug 11, 2009, at 2:45 PM, Jim Bouldin wrote:

>
> No problem John, thanks for your help, and also thanks to Dan and  
> Patrick.
> Wasn't able to read or try anybody's suggestions yesterday.  Here's  
> what
> I've discovered in the meantime:
>
> What I did not include yesterday is that my original data frame,  
> called
> "data", was this:
>
>   X Y       V3
> 1  1 1 0.000000
> 2  2 1 8.062258
> 3  3 1 2.236068
> 4  4 1 6.324555
> 5  5 1 5.000000
> 6  1 2 8.062258
> 7  2 2 0.000000
> 8  3 2 9.486833
> 9  4 2 2.236068
> 10 5 2 5.656854
> 11 1 3 2.236068
> 12 2 3 9.486833
> 13 3 3 0.000000
> 14 4 3 8.062258
> 15 5 3 5.099020
> 16 1 4 6.324555
> 17 2 4 2.236068
> 18 3 4 8.062258
> 19 4 4 0.000000
> 20 5 4 5.385165
> 21 1 5 5.000000
> 22 2 5 5.656854
> 23 3 5 5.099020
> 24 4 5 5.385165
> 25 5 5 0.000000
>
> To this data frame I applied the following command:
>
> data <- data[data$V3 >0,];data #to remove all rows where V3 = 0
>
> giving me this (the point from which I started yesterday):
>
>   X Y       V3
> 2  2 1 8.062258
> 3  3 1 2.236068
> 4  4 1 6.324555
> 5  5 1 5.000000
> 6  1 2 8.062258
> 8  3 2 9.486833
> 9  4 2 2.236068
> 10 5 2 5.656854
> 11 1 3 2.236068
> 12 2 3 9.486833
> 14 4 3 8.062258
> 15 5 3 5.099020
> 16 1 4 6.324555
> 17 2 4 2.236068
> 18 3 4 8.062258
> 20 5 4 5.385165
> 21 1 5 5.000000
> 22 2 5 5.656854
> 23 3 5 5.099020
> 24 4 5 5.385165
>
> So far so good.  But when I then submit the command
>> data = data[X>Y,] #to select all rows where X > Y

This won't work in general, and is probably only working in this  
particular case because you already have defined somewhere in your  
workspace vars named X and Y.

What you wrote above isn't taking the values X,Y from data$X and data 
$Y, respectively, but rather from var X and Y defined elsewhere.

Instead of doing data[X > Y], do:

data[data$X > data$Y,]

This should get you what you're expecting.

> I get the problem result already mentioned, namely:
>
>   X Y       V3
> 3  3 1 2.236068
> 4  4 1 6.324555
> 5  5 1 5.000000
> 6  1 2 8.062258
> 10 5 2 5.656854
> 11 1 3 2.236068
> 12 2 3 9.486833
> 17 2 4 2.236068
> 18 3 4 8.062258
> 24 4 5 5.385165
>
> which is clearly wrong!  It doesn't matter if I give a new name to  
> the data
> frame at each step or not, or whether I use the name "data" or not.   
> It
> always gives the same wrong answer.
>
> However, if I instead use the command:
> subset(data, X>Y), I get the right answer, namely:
>
>   X Y       V3
> 2  2 1 8.062258
> 3  3 1 2.236068
> 4  4 1 6.324555
> 5  5 1 5.000000
> 8  3 2 9.486833
> 9  4 2 2.236068
> 10 5 2 5.656854
> 14 4 3 8.062258
> 15 5 3 5.099020
> 20 5 4 5.385165

That's because when you are using X, and Y in your subset(...) call,  
THIS takes X and Y to mean data$X and data$Y.

> OK so the lesson so far is "use the subset function".

Hopefully you're learning a slightly different lesson now :-)

Does that clear things up at all?

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact




More information about the R-help mailing list