[R] Difficulty with 'merge'

Thu Jan 5 09:45:28 CET 2006

Dear Michael

Please remark that merge calculates all possible combinations if
you have repeated elements as you can see in the example below. 

?merge

"... If there is more than one match, all possible matches
contribute one row each. ..."

Maybe you can apply "aggregate" in a reasonable way on your 
data.frame first to summarize your repeated values to unique
ones and the proceed with merge, but that depends on your
problem. 

Regards,

Christoph

--------------------------------------------------------------
Christoph Buser <buser at stat.math.ethz.ch>
Seminar fuer Statistik, LEO C13
ETH (Federal Inst. Technology)	8092 Zurich	 SWITZERLAND
phone: x-41-44-632-4673		fax: 632-1228
http://stat.ethz.ch/~buser/
--------------------------------------------------------------

example with repeated values
----------------------------

v1 <- c("a", "b", "a", "b", "a")
n1 <- 1:5
v2 <- c("b", "b", "a", "a", "a")
n2 <- 6:10
(f1  <- data.frame(v1, n1))
(f2 <- data.frame(v2, n2))
(m12 <- merge(f1, f2, by.x = "v1", by.y = "v2", sort = F))

Michael Kubovy writes:
 > Dear R-helpers,
 > 
 > Happy New Year to all the helpful members of the list.
 > 
 > Here is the behavior I'm looking for:
 >  > v1 <- c("a","b","c")
 >  > n1 <- c(0, 1, 2)
 >  > v2 <- c("c", "a", "b")
 >  > n2 <- c(0, 1 , 2)
 >  > (f1  <- data.frame(v1, n1))
 >    v1 n1
 > 1  a  0
 > 2  b  1
 > 3  c  2
 >  > (f2 <- data.frame(v2, n2))
 >    v2 n2
 > 1  c  0
 > 2  a  1
 > 3  b  2
 >  > (m12 <- merge(f1, f2, by.x = "v1", by.y = "v2", sort = F))
 >    v1 n1 n2
 > 1  c  2  0
 > 2  a  0  1
 > 3  b  1  2
 > 
 > Now to my data:
 >  > summary(pL)
 >          pairL
 > a fondo   :  41
 > alto      :  41
 > ampio     :  41
 > angoloso  :  41
 > aperto    :  41
 > appoggiato:  41
 > (Other)   :1271
 > 
 >  > pL$pairL[c(1,42)]
 > [1] appoggiato dentro
 > 37 Levels: a fondo alto ampio angoloso aperto appoggiato asimmetrico  
 > complicato convesso davanti dentro destra ... verticale
 > 
 >  > summary(oppN)
 >          pairL              pairR         subject            
 > L                LL                RR               M
 > a fondo   :  41   a galla    :  41   S1     :  37   Min.   :0.3646    
 > Min.   :0.02083   Min.   :0.0010   Min.   :0.0000
 > alto      :  41   acuto      :  41   S10    :  37   1st Qu.:0.5521    
 > 1st Qu.:0.37500   1st Qu.:0.1771   1st Qu.:0.1042
 > ampio     :  41   arrotondato:  41   S11    :  37   Median :0.6354    
 > Median :0.47917   Median :0.2708   Median :0.2292
 > angoloso  :  41   basso      :  41   S12    :  37   Mean   :0.6403    
 > Mean   :0.46452   Mean   :0.2760   Mean   :0.2598
 > aperto    :  41   chiuso     :  41   S13    :  37   3rd Qu.:0.7188    
 > 3rd Qu.:0.55208   3rd Qu.:0.3750   3rd Qu.:0.3854
 > appoggiato:  41   compl      :  41   S14    :  37   Max.   :0.9375    
 > Max.   :0.92708   Max.   :0.6042   Max.   :0.7812
 > (Other)   :1271   (Other)    :1271   (Other): 
 > 1295                                      NA's   :3.0000   NA's   : 
 > 3.0000
 >        asym             polar            polar_a1          clust
 > Min.   :-0.5555   Min.   :-1.2410   Min.   :-2.949e+00   c1:492
 > 1st Qu.: 0.2091   1st Qu.: 0.4571   1st Qu.:-1.902e-01   c2:287
 > Median : 0.5555   Median : 1.1832   Median :-1.110e-16   c3: 82
 > Mean   : 0.6265   Mean   : 1.3428   Mean   :-5.745e-02   c4:246
 > 3rd Qu.: 0.9383   3rd Qu.: 2.0712   3rd Qu.: 1.168e-01   c5: 82
 > Max.   : 2.7081   Max.   : 4.6151   Max.   : 4.218e+00   c6:328
 >                     NA's   : 3.0000   NA's   : 3.000e+00
 > 
 >  > oppN$pairL[c(1,42)]
 > [1] spesso fine
 > 37 Levels: a fondo alto ampio angoloso aperto appoggiato asimmetrico  
 > complicato convesso davanti dentro destra ... verticale
 > 
 >  > unique(sort(oppM$pairL)) == unique(sort(pL$pairL))
 > [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  
 > TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 > [26] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 > 
 > In other words I think that pL$pairL and oppN$pairL consists of 37  
 > blocks of 41 repetitions of names, and that these blocks are  
 > permutations of each other,
 > 
 > However:
 > 
 >  > summary(m1 <- merge(oppM, pairL, by.x = "pairL", by.y = "pairL",  
 > sort = F))
 >          pairL               pairR          subject             
 > L                LL                RR               M
 > a fondo   : 1681   a galla    : 1681   S1     : 1517   Min.   : 
 > 0.3646   Min.   :0.02083   Min.   :0.0010   Min.   :0.0000
 > alto      : 1681   acuto      : 1681   S10    : 1517   1st Qu.: 
 > 0.5521   1st Qu.:0.37500   1st Qu.:0.1771   1st Qu.:0.1042
 > ampio     : 1681   arrotondato: 1681   S11    : 1517   Median : 
 > 0.6354   Median :0.47917   Median :0.2708   Median :0.2292
 > angoloso  : 1681   basso      : 1681   S12    : 1517   Mean   : 
 > 0.6398   Mean   :0.46402   Mean   :0.2760   Mean   :0.2598
 > aperto    : 1681   chiuso     : 1681   S13    : 1517   3rd Qu.: 
 > 0.7188   3rd Qu.:0.55208   3rd Qu.:0.3750   3rd Qu.:0.3854
 > appoggiato: 1681   compl      : 1681   S14    : 1517   Max.   : 
 > 0.9375   Max.   :0.92708   Max.   :0.6042   Max.   :0.7812
 > (Other)   :51988   (Other)    :51988   (Other):52972
 >        asym             polar            polar_a1          clust
 > Min.   :-0.5555   Min.   :-1.2410   Min.   :-2.949e+00   c1:20172
 > 1st Qu.: 0.2091   1st Qu.: 0.4571   1st Qu.:-1.904e-01   c2:11644
 > Median : 0.5555   Median : 1.1832   Median :-1.110e-16   c3: 3362
 > Mean   : 0.6234   Mean   : 1.3428   Mean   :-5.745e-02   c4:10086
 > 3rd Qu.: 0.9383   3rd Qu.: 2.0712   3rd Qu.: 1.169e-01   c5: 3362
 > Max.   : 2.7081   Max.   : 4.6151   Max.   : 4.218e+00   c6:13448
 > 
 > I was expecting pairL to be 41 items longs, not 1681 = 41^2.
 > _____________________________
 > Professor Michael Kubovy
 > University of Virginia
 > Department of Psychology
 > USPS:     P.O.Box 400400    Charlottesville, VA 22904-4400
 > Parcels:    Room 102        Gilmer Hall
 >          McCormick Road    Charlottesville, VA 22903
 > Office:    B011    +1-434-982-4729
 > Lab:        B019    +1-434-982-4751
 > Fax:        +1-434-982-4766
 > WWW:    http://www.people.virginia.edu/~mk9y/
 > 
 > ______________________________________________
 > R-help at stat.math.ethz.ch mailing list
 > https://stat.ethz.ch/mailman/listinfo/r-help
 > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html