[R] problem selecting rows meeting a criterion

Tue Aug 11 20:45:01 CEST 2009

No problem John, thanks for your help, and also thanks to Dan and Patrick.
Wasn't able to read or try anybody's suggestions yesterday.  Here's what
I've discovered in the meantime:

What I did not include yesterday is that my original data frame, called
"data", was this:

   X Y       V3
1  1 1 0.000000
2  2 1 8.062258
3  3 1 2.236068
4  4 1 6.324555
5  5 1 5.000000
6  1 2 8.062258
7  2 2 0.000000
8  3 2 9.486833
9  4 2 2.236068
10 5 2 5.656854
11 1 3 2.236068
12 2 3 9.486833
13 3 3 0.000000
14 4 3 8.062258
15 5 3 5.099020
16 1 4 6.324555
17 2 4 2.236068
18 3 4 8.062258
19 4 4 0.000000
20 5 4 5.385165
21 1 5 5.000000
22 2 5 5.656854
23 3 5 5.099020
24 4 5 5.385165
25 5 5 0.000000

To this data frame I applied the following command:

data <- data[data$V3 >0,];data #to remove all rows where V3 = 0

giving me this (the point from which I started yesterday):

   X Y       V3
2  2 1 8.062258
3  3 1 2.236068
4  4 1 6.324555
5  5 1 5.000000
6  1 2 8.062258
8  3 2 9.486833
9  4 2 2.236068
10 5 2 5.656854
11 1 3 2.236068
12 2 3 9.486833
14 4 3 8.062258
15 5 3 5.099020
16 1 4 6.324555
17 2 4 2.236068
18 3 4 8.062258
20 5 4 5.385165
21 1 5 5.000000
22 2 5 5.656854
23 3 5 5.099020
24 4 5 5.385165

So far so good.  But when I then submit the command
> data = data[X>Y,] #to select all rows where X > Y

I get the problem result already mentioned, namely:

   X Y       V3
3  3 1 2.236068
4  4 1 6.324555
5  5 1 5.000000
6  1 2 8.062258
10 5 2 5.656854
11 1 3 2.236068
12 2 3 9.486833
17 2 4 2.236068
18 3 4 8.062258
24 4 5 5.385165

which is clearly wrong!  It doesn't matter if I give a new name to the data
frame at each step or not, or whether I use the name "data" or not.  It
always gives the same wrong answer.

However, if I instead use the command:
subset(data, X>Y), I get the right answer, namely:

   X Y       V3
2  2 1 8.062258
3  3 1 2.236068
4  4 1 6.324555
5  5 1 5.000000
8  3 2 9.486833
9  4 2 2.236068
10 5 2 5.656854
14 4 3 8.062258
15 5 3 5.099020
20 5 4 5.385165

OK so the lesson so far is "use the subset function".  But here it gets
weirder.  If I instead go straight from the initial data frame ("data",
given at the top of this post), selecting only rows where X>Y (without the
intermediate step of removing rows with V3 = 0, which although is 
unnecessary in getting the result I want, is very relevant to the larger
issue here), by using the command that caused me the original trouble (data
= data[X>Y,]), I get the RIGHT answer (the data frame just above).  The
subset function also gives the right answer. Now what in the world is going
on?  This kind of thing scares me.

Below is the full set of commands starting from scratch: 

#Point of the following is to measure the pairwise euclidean distances
between 5 objects, each having X and Y coordinates
#and put them into data frame format that labels each pair and gives the
distance between them

d = data.frame(x=sample(1:10, 5), y=sample(1:10, 5)) #create a sample data set
ss2 = as.data.frame(as.matrix(dist(d))) #create a data.frame to extract row
and column names
X = rep(seq(1:length(row.names(ss2))), length(names(ss2))) #make a vector
containing the X coordinate names
Y = rep(seq(1:length(names(ss2))), length(row.names(ss2))) #the same for Y
Y = sort(Y) #first sort
coords = cbind(X, Y);rm(X,Y) #then cbind and remove X and Y
data1 = as.data.frame(cbind(coords,
as.vector(as.matrix(dist(d)))));rm(coords) # column bind the 3 vectors
data2 = data1[data1$V3 >0,] #remove those with V3 = 0 (= the original
matrix diagonal)
data3 = data2[X>Y,] #remove duplicates from original distance matrix
data1;data2;data3

Thoughts much appreciated.  Thanks.
Jim Bouldin

> 
> Clearly I was more tired than I realised last night. :( My appologies.
> 
> In any case with the data.frame name changed to xx this seems to give you
> what you want
> 
>   subset(xx, xx[,1] > xx[,2])
> 
> or using the data name
>    subset(data, data[,1] > data[,2])  
> should work as well