[R] problems with merge() - the output has many repeated lines

Cecilia Carmo cecilia.carmo at ua.pt
Mon Aug 23 15:51:46 CEST 2010


Thank you all for your help and patience.

I’have done table(duplicated(df1[, c("firm","year")])) as 
William Dunlap suggested and I find repeated rows in df1.
R is always right!

I really believed that my data could not be repeated 
lines. I now have another problem which is to discover why 
this happened with my data, but this has nothing to do 
with the R!

Thank you again and again,

Cecília Carmo
Universidade de Aveiro
Portugal


Em Sun, 22 Aug 2010 13:15:36 -0700
  "William Dunlap" <wdunlap at tibco.com> escreveu:
>> -----Original Message-----
>> From: r-help-bounces at r-project.org 
>> [mailto:r-help-bounces at r-project.org] On Behalf Of 
>>Cecilia Carmo
>> Sent: Sunday, August 22, 2010 10:24 AM
>> To: Erik Iverson
>> Cc: r-help at r-project.org; Hadley Wickham
>> Subject: Re: [R] problems with merge() - the output has 
>>many 
>> repeated lines
>> 
>> I have done
>> intersect(names(df1), names(df2))
>> [1] "firm" "year"
>> 
>> This is the key I used to merge
>> merge(df1,df2,by=c("firm","year"))
>> 
>> And there is just one row firm/year in df1 that matches 
>> with another firm/year row in df2. Df1 has more 
>>firm/year 
>> rows than df2, and them don't match with none in df2.
> 
> To get to the bottom of this you may have to show
> us some of the relevant rows of data (80000 rows
> per dataset would be a lot to mailout).  For starters
> it would be nice to see the output of 
>   str(df1)
>   str(df2)
>   str(m) # where m is merge(df1,df2)
> Then it would nice to see the output of
>   table(duplicated(df1[, c("firm","year")]))
> and the same for df2 and m.
> 
> You said you saw many repeated rows in the output of
> merge(df1,df2,...), which I am calling 'm'.  Say the 
>i'th
> row is one of the repeated rows.  What are the outputs 
>of
>   df1[ df1$firm==m$firm[i] & df1$year==m$year[i], 
>,drop=FALSE]
>   df2[ df2$firm==m$firm[i] & df2$year==m$year[i], 
>,drop=FALSE]
>   m[ m$firm==m$firm[i] & m$year==m$year[i], ,drop=FALSE]
> ?
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com 
> 
>> Cecília
>> 
>> Em Sun, 22 Aug 2010 12:09:57 -0500
>>   Erik Iverson <eriki at ccbr.umn.edu> escreveu:
>> > Cecilia -
>> > 
>> >Find what columns you're matching on,
>> > 
>> > intersect(names(df1), names(df2)),
>> > 
>> > Maybe that will shed some light on the issue.
>> > 
>> > On 08/22/2010 12:02 PM, Cecilia Carmo wrote:
>> >> Thanks, but I don't have multiple matches and the 
>>lines 
>> >>repeated in the
>> >> final dataframe are exactly equal in all columns.
>> >>
>> >> Cecília
>> >>
>> >> Sat, 21 Aug 2010 10:58:53 -0500
>> >> Hadley Wickham <hadley at rice.edu> escreveu:
>> >>> You may find a close reading of ?merge helpful, 
>> >>>particularly this
>> >>> sentence: "If there is more than one match, all 
>>possible
>> >>> matches contribute one row each" (so check that you 
>> >>>don't have
>> >>> multiple matches).
>> >>>
>> >>> Hadley
>> >>>
>> >>> On Sat, Aug 21, 2010 at 10:45 AM, Cecilia Carmo 
>> >>><cecilia.carmo at ua.pt>
>> >>> wrote:
>> >>>> Hi everyone,
>> >>>>
>> >>>> I have been merging many big dataframes (about 
>>80000 
>> >>>>rows each) and I
>> >>>> never
>> >>>> had this problem, but now it happened to me and I 
>>want 
>> >>>>to know if
>> >>>> someone
>> >>>> knows what could be happening.
>> >>>> The final dataframe has many rows, an impossible 
>>number! 
>> >>>>I have done
>> >>>> edit(dataframe) and I saw that there are many 
>>repeated 
>> >>>>rows (all equal).
>> >>>>
>> >>>> Thanks for any help,
>> >>>>
>> >>>> Cecília Carmo
>> >>>> Universidade de Aveiro
>> >>>> Portugal
>> >>>>
>> >>>> ______________________________________________
>> >>>> R-help at r-project.org mailing list
>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>> PLEASE do read the posting guide
>> >>>> http://www.R-project.org/posting-guide.html
>> >>>> and provide commented, minimal, self-contained, 
>> >>>>reproducible code.
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Assistant Professor / Dobelman Family Junior Chair
>> >>> Department of Statistics / Rice University
>> >>> http://had.co.nz/
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, 
>> >>reproducible code.
>> >
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, 
>>reproducible code.
>>



More information about the R-help mailing list