[BioC] Merge dataframes

Sat Oct 8 17:45:56 CEST 2011

Hi everybody,

Paniagua explanation and solution worked very good! Thank you a lot!
I got the code to what I need.

As Paniagua said, I was losing row.names when I merged a data.frame  
with a factor object. Instead of merging data1 to data2$color, I  
merged it to data2["color"]. It preserves row.names.

When I do:

d1 <- merge(data1, data2["color"], by=0, all.x=T)

It creates a new variable on column 1, with the original row.names, so  
I have to transform this data.frame in a new one, using this variable  
as the row.names:

d2 <- data.frame(d1, row.names=1)

It does the job, but row.names are not in the same order as in data1,  
so I reorder it:

d2[row.names(data1),]

And it gives me the dataframe data1, binded to the new variable color,  
with the NAs in the proper position.

Thank you all!

Cheers,

Joao D.

Em 07/10/2011, às 20:36, Paniagua, Eric escreveu:

> Hi Joao,
>
> To save space, I'm not going to paste the data frames in.  Just  
> start a session, run your original code, and then try out the  
> commands below.
>
> The problem is a simple type confusion.  If you check the class of  
> data2$color, you'll see it's not a data frame:
>
>> class(data2$color)
> [1] "factor"
>
> The problem is that a factor is not a data frame, but merge expects  
> data frames as arguments.  If either one is not a data frame, merge  
> will coerce the type if possible.  In this case, that means that  
> passing data2$color to merge is equivalent to this:
>
>> merge(data1, as.data.frame(data2$color), by=0, all.x=T)
>
> If you inspect the coerced version of the argument, you'll see that  
> new row names have been assigned during coercion.  Well, technically  
> not "new" ones because a factor does not have row names.  In other  
> words, the row names were lost when you used the '$'.  Just run the  
> following to see:
>
>> data2$color
>> as.data.frame(data2$color)
>
> The solution is to just subset the data frame instead of pulling out  
> the factor.  If you replace data2$color in your code with  
> data2["color"], your code behaves as intended.
>
> A good first clue in discovering this issue is the fact that in your  
> original code, you end up in the <NA> elements all at the bottom,  
> rows 16-20, every time.  The odds of this specific arrangement  
> happening any one particular time are 1 in 15504, or about  
> 0.00645%.  Coincidentally, the 15 non <NA> values are occupying rows  
> 1-15, making the hypothesis that your rows are being relabeled  
> somewhere along the way much more probable.  And, in this case,  
> correct.
>
> Underlying that insight or intuition is the much more general  
> debugging approach of actually checking and looking at every  
> intermediate value in your computation that you can.  And, of  
> course, reading the documentation :)
>
> For instance, check out the class and value of data2["color"]:
>
>> class(data2["color"])
> [1] "data.frame"
>
> You can also see that data2["color"] preserves the row names:
>
>> data2["color"]
>
> You can try this to verify that with my solution the values work out  
> correctly (though I'm sure there's a better way to do it directly in  
> R):
>
>> data3 <- merge(data1, data2["color"], by=0, all.x=T)
>> data3
>> data4 <- data.frame(data3, row.names=1)
>> data4
>> data4[row.names(data1),]
>> data5 <- merge(data1, data4, by=0)
>> data5
>> data6 <- data.frame(data5, row.names=1)
>> data6
>> data6[row.names(data1),]
>
> (Of course, I'm using way more variables there than are required for  
> the computation.  Consider their addition as purely didactic.)
>
> Anyway, I hope that helps!  Happy scripting.
>
> Best,
> Eric
>
> ________________________________________
> From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org 
> ] on behalf of João Daniel Nunes Duarte [jdanielnd at gmail.com]
> Sent: Friday, October 07, 2011 6:32 PM
> Subject: [BioC] Merge dataframes
>
> Hello,
>
> I am having some problems to use the 'merge' function. I'm not sure  
> if I got
> its working right.
>
> What I want to do is:
>
> 1) Suppose I have a dataframe like:
>
>        height        width
> 1        1.1                2.3
> 2        2.1                2.5
> 3        1.8                1.9
> 4        1.6                2.1
> 5        1.8                2.4
>
> 2) And I generate a second dataframe sampled from this one, like:
>
>        height        width
> 1        1.1                2.3
> 3        1.8                1.9
> 5        1.8                2.4
>
> 3) Next, I add a new variable from this dataframe:
>
>        height        width        color
> 1        1.1                2.3            red
> 3        1.8                1.9            red
> 5        1.8                2.4            blue
>
> 4) So, I want to merge those dataframes, so that the new variable,  
> color, is
> binded to the first dataframe. Of course some cases won't have value  
> for it,
> since I generated this variable in a smaller dataframe. In those  
> cases I
> want the value to be NA. The result dataframe should be:
>
>        height        width        color
> 1        1.1                2.3            red
> 2        2.1                2.5            NA
> 3        1.8                1.9            red
> 4        1.6                2.1            NA
> 5        1.8                2.4            blue
>
> I have written some codes, but they're not working properly. The new
> variable has its values mixed up, and they do not correspond to its
> row.names.
>
> # Generate the first dataframe
> data1 <- data.frame(height=rnorm(20,3,0.2),width=rnorm(20,2,0.5))
> # Sample a smaller dataframe from data1
> data2 <- data1[sample(1:20,15,replace=F),]
> # Generate the new variable
> color <- sample(c("red","blue"),15,replace=T)
> # Bind the new variable to data2
> data2 <- cbind(data2, color)
> # Merge the data1 and data2$color by row.names, and force it to has  
> the same
> values that data1. Next it generates a new dataframe where column 1  
> is the
> row.name, and then sort it by the row.name from data1.
> data.frame(merge(data1,data2$color, by=0,
> all.x=T),row.names=1)[row.names(data1),]
>
> I'm not sure what am I doing wrong. Can anyone see where the mistake  
> is?
>
> Thank you!
>
> Cheers,
>
> Joao D.
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor