[R] possible bug in merge with duplicate blank names in 'by' field.

Fri Jun 17 05:03:55 CEST 2005

What version of R are you using?   I don't get the same
result on my system:

> R.version.string # Windows XP
[1] "R version 2.1.0, 2005-06-10"
> p <- c('a', 'c', '', ''); a <- c(10, 20, 30, 40); d1 <-
+ data.frame(Promoter=p, ip=a) # Note duplicate empty names in p.
> p <- c('b', 'c', 'd', ''); a <- c(15, 20, 30, 40); d2 <-
+ data.frame(Promoter=p, ip=a)
> all <- merge(x=d1, y=d2, by="Promoter", all=T)
> all <- merge(x=all, y=d2, by="Promoter", all=T)
> all
  Promoter ip.x ip.y ip
1            30   40 40
2            40   40 40
3        a   10   NA NA
4        c   20   20 20
5        b   NA   15 15
6        d   NA   30 30

On 6/16/05, Frank Gibbons <fgibbons at hms.harvard.edu> wrote:
> Run this:
> 
> >p <- c('a', 'c', '', ''); a <- c(10, 20, 30, 40); d1 <-
> >data.frame(Promoter=p, ip=a) # Note duplicate empty names in p.
> >p <- c('b', 'c', 'd', ''); a <- c(15, 20, 30, 40); d2 <-
> >data.frame(Promoter=p, ip=a)
> >all <- merge(x=d1, y=d2, by="Promoter", all=T)
> >all <- merge(x=all, y=d2, by="Promoter", all=T)
> >all
> 
> Data is this:
> 
> >d1
> >   Promoter ip
> >1        a 10
> >2        c 20
> >3          30
> >4          40
> >
> >d2
> >   Promoter ip
> >1        b 15
> >2        c 20
> >3        d 30
> >4          40
> 
> Output looks like this:
> 
> >   Promoter ip.x ip.y ip
> >1            40   30 30
> >2            40   40 30
> >3            40   30 40
> >4            40   40 40
> >5        b   15   NA NA
> >6        c   20   20 20
> >7        d   30   NA NA
> >8        a   NA   10 10
> 
> The weird thing about this is (in my view) that each instance of '' is
> considered unique, so with each successive merge, all combinatorial
> possibilities are explored, like a SQL outer join (Cartesian product). For
> non-empty names, an inner join is performed.
> 
> Dealing with genomic data (10^4 datapoints), it's easy to have a couple of
> blanks buried in the middle of things, and to combine several replicates
> with successive merges. I couldn't understand how my three replicates of
> 6000 points, in which I expected  substantial overlap in the labels, were
> taking so long to merge and ultimately generating 57000 labels. The culprit
> turned out to be a few hundred blanks buried in the middle.
> 
> Why does the empty ("null") name merit special treatment? Perhaps I'm
> missing something. I hesitate to submit this as a bug, since technically I
> guess you could say that blank names, especially duplicates, are not
> kosher. But on the other hand, this combinatorial behaviour seems to occur
> only for blanks.
> 
> -Frank
> 
> PhD, Computational Biologist,
> Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115, USA.
> Tel: 617-432-3555       Fax:
> 617-432-3557       http://llama.med.harvard.edu/~fgibbons
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>