[R] sqldf: issues with natural joins

Thu May 20 18:42:36 CEST 2010

Although that works I had meant to write:

> names(B)[2] <- "dfNameB"
> # ... other commands
> sqldf('select * from main.A natural join main.B')

so that now only Tid is in common so the natural join just picks it up
and also the heuristic works again since we no longer retrieve
duplicate column names.

On Thu, May 20, 2010 at 12:32 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> There are two problems:
>
> 1. A natural join will join all columns with the same names in the two
> tables and that includes not only Tid but also dfName and since there
> are no rows that have the same Tid and dfName the result has zero
> rows.
>
> 2. the heuristic it uses fails when you retrieve the same column name
> from multiple tables so use method = "raw" to turn off the heuristic.
> The heuristic will be improved to cover this case in the future.
> Read FAQ #1 on the home page:
> http://code.google.com/p/sqldf/#1._How_does_sqldf_handle_classes_and_factors?
>
> This should work:
>
>> sqldf('select * from main.A join main.B using(Tid)', method = "raw")
>                     Tid dfName dfName
> 1  AES 01-01-02 11:53:00      a      b
> 2 AES 01-01-05\n10:58:00      a      b
> 3  AES 01-01-11 12:30:00      a      b
>
> This works too as the double dfName no longer exists to confuse the heuristic:
>
> names(B)[2] <- "dfNameB"
> sqldf('select * from main.A join main.B using(Tid)')
>
>
>
> On Thu, May 20, 2010 at 12:04 PM, Nick Switanek <nswitanek at gmail.com> wrote:
>> Hello,
>>
>> I'm having trouble discovering what's going wrong with my use of natural
>> joins via sqldf.
>>
>> Following the instructions under 4i at http://code.google.com/p/sqldf/,
>> which discusses creating indices to speed joins, I have been only unreliably
>> able to get natural joins to work.
>>
>> For example,
>>
>>> Tid <- c('AES 01-01-02 10:58:00', 'AES 01-01-02 11:53:00', 'AES 01-01-05
>> 10:58:00', 'AES 01-01-11 12:30:00')
>>> A <- data.frame(Tid, dfName = 'a')
>>> B <- data.frame(Tid = Tid[2:4], dfName = 'b')
>>> C <- data.frame(Tid = Tid[1:3], dfName = 'c')
>>
>> # then use the sqldf library
>>> library(sqldf)
>>> sqldf()
>>
>> # to create indices on the Tid variable shared across data.frames
>>> sqldf('create index indA on A(Tid)')
>>> sqldf('create index indB on B(Tid)')
>>> sqldf('create index indC on C(Tid)')
>>
>> # check to make sure everything is there
>>> sqldf('select * from sqlite_master')
>>
>> # doing a natural join (implicitly on Tid)
>> # does not give the expected joins
>>> sqldf('select * from main.A natural join main.B')
>> [1] Tid    dfName
>> <0 rows> (or 0-length row.names)
>>> sqldf('select * from main.A natural join main.C')
>> [1] Tid    dfName
>> <0 rows> (or 0-length row.names)
>>> sqldf('select * from main.B natural join main.C')
>> [1] Tid    dfName
>> <0 rows> (or 0-length row.names)
>>
>> # even using a where clause (which doesn't have the efficiency qualities I
>> need the indexed natural joins for) is problematic, setting values of the
>> dfName variable incorrectly for the data from C
>>> sqldf('select * from main.B b, main.C c where b.Tid = c.Tid')
>>                    Tid dfName                   Tid dfName
>> 1 AES 01-01-02 11:53:00      b AES 01-01-02 11:53:00      b
>> 2 AES 01-01-05 10:58:00      b AES 01-01-05 10:58:00      b
>>
>> I'm grateful for your guidance on what I'm doing wrong with the natural join
>> in sqldf.
>>
>> many thanks,
>> Nick
>