[R] sqldf: issues with natural joins

Thu May 20 20:18:10 CEST 2010

They work on any join that is able to make use of them.   If you
preface the select statement with explain query plan then it will give
you some info, e.g.

> sqldf('explain query plan select * from main.A natural join main.B')
  order from                  detail
1     0    0                 TABLE A
2     1    1 TABLE B WITH INDEX indB

Note that its using the index on B but not the index on A so there was
actually no point in adding that one if this were the only query.

This is potentially a large topic so see the sqlite.org site and
mailing list for more info.

On Thu, May 20, 2010 at 1:28 PM, Nick Switanek <nswitanek at gmail.com> wrote:
> Thank you very much for these clarifying responses, Gabor.
>
> I had mistakenly assumed that creating the index on Tid restricted the
> natural join to joining on Tid. Can you describe when and how indices speed
> up joins, or can you point me to resources that address this? Is it only for
> natural joins or any joins (including, say, a select statement with where
> clause)?
>
> thanks,
> nick
>
>
> On Thu, May 20, 2010 at 11:42 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>>
>> Although that works I had meant to write:
>>
>> > names(B)[2] <- "dfNameB"
>> > # ... other commands
>> > sqldf('select * from main.A natural join main.B')
>>
>> so that now only Tid is in common so the natural join just picks it up
>> and also the heuristic works again since we no longer retrieve
>> duplicate column names.
>>
>> On Thu, May 20, 2010 at 12:32 PM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>> > There are two problems:
>> >
>> > 1. A natural join will join all columns with the same names in the two
>> > tables and that includes not only Tid but also dfName and since there
>> > are no rows that have the same Tid and dfName the result has zero
>> > rows.
>> >
>> > 2. the heuristic it uses fails when you retrieve the same column name
>> > from multiple tables so use method = "raw" to turn off the heuristic.
>> > The heuristic will be improved to cover this case in the future.
>> > Read FAQ #1 on the home page:
>> >
>> > http://code.google.com/p/sqldf/#1._How_does_sqldf_handle_classes_and_factors?
>> >
>> > This should work:
>> >
>> >> sqldf('select * from main.A join main.B using(Tid)', method = "raw")
>> >                     Tid dfName dfName
>> > 1  AES 01-01-02 11:53:00      a      b
>> > 2 AES 01-01-05\n10:58:00      a      b
>> > 3  AES 01-01-11 12:30:00      a      b
>> >
>> > This works too as the double dfName no longer exists to confuse the
>> > heuristic:
>> >
>> > names(B)[2] <- "dfNameB"
>> > sqldf('select * from main.A join main.B using(Tid)')
>> >
>> >
>> >
>> > On Thu, May 20, 2010 at 12:04 PM, Nick Switanek <nswitanek at gmail.com>
>> > wrote:
>> >> Hello,
>> >>
>> >> I'm having trouble discovering what's going wrong with my use of
>> >> natural
>> >> joins via sqldf.
>> >>
>> >> Following the instructions under 4i at http://code.google.com/p/sqldf/,
>> >> which discusses creating indices to speed joins, I have been only
>> >> unreliably
>> >> able to get natural joins to work.
>> >>
>> >> For example,
>> >>
>> >>> Tid <- c('AES 01-01-02 10:58:00', 'AES 01-01-02 11:53:00', 'AES
>> >>> 01-01-05
>> >> 10:58:00', 'AES 01-01-11 12:30:00')
>> >>> A <- data.frame(Tid, dfName = 'a')
>> >>> B <- data.frame(Tid = Tid[2:4], dfName = 'b')
>> >>> C <- data.frame(Tid = Tid[1:3], dfName = 'c')
>> >>
>> >> # then use the sqldf library
>> >>> library(sqldf)
>> >>> sqldf()
>> >>
>> >> # to create indices on the Tid variable shared across data.frames
>> >>> sqldf('create index indA on A(Tid)')
>> >>> sqldf('create index indB on B(Tid)')
>> >>> sqldf('create index indC on C(Tid)')
>> >>
>> >> # check to make sure everything is there
>> >>> sqldf('select * from sqlite_master')
>> >>
>> >> # doing a natural join (implicitly on Tid)
>> >> # does not give the expected joins
>> >>> sqldf('select * from main.A natural join main.B')
>> >> [1] Tid    dfName
>> >> <0 rows> (or 0-length row.names)
>> >>> sqldf('select * from main.A natural join main.C')
>> >> [1] Tid    dfName
>> >> <0 rows> (or 0-length row.names)
>> >>> sqldf('select * from main.B natural join main.C')
>> >> [1] Tid    dfName
>> >> <0 rows> (or 0-length row.names)
>> >>
>> >> # even using a where clause (which doesn't have the efficiency
>> >> qualities I
>> >> need the indexed natural joins for) is problematic, setting values of
>> >> the
>> >> dfName variable incorrectly for the data from C
>> >>> sqldf('select * from main.B b, main.C c where b.Tid = c.Tid')
>> >>                    Tid dfName                   Tid dfName
>> >> 1 AES 01-01-02 11:53:00      b AES 01-01-02 11:53:00      b
>> >> 2 AES 01-01-05 10:58:00      b AES 01-01-05 10:58:00      b
>> >>
>> >> I'm grateful for your guidance on what I'm doing wrong with the natural
>> >> join
>> >> in sqldf.
>> >>
>> >> many thanks,
>> >> Nick
>> >
>
>