[R] sqldf: issues with natural joins
Gabor Grothendieck
ggrothendieck at gmail.com
Thu May 20 18:32:12 CEST 2010
There are two problems:
1. A natural join will join all columns with the same names in the two
tables and that includes not only Tid but also dfName and since there
are no rows that have the same Tid and dfName the result has zero
rows.
2. the heuristic it uses fails when you retrieve the same column name
from multiple tables so use method = "raw" to turn off the heuristic.
The heuristic will be improved to cover this case in the future.
Read FAQ #1 on the home page:
http://code.google.com/p/sqldf/#1._How_does_sqldf_handle_classes_and_factors?
This should work:
> sqldf('select * from main.A join main.B using(Tid)', method = "raw")
Tid dfName dfName
1 AES 01-01-02 11:53:00 a b
2 AES 01-01-05\n10:58:00 a b
3 AES 01-01-11 12:30:00 a b
This works too as the double dfName no longer exists to confuse the heuristic:
names(B)[2] <- "dfNameB"
sqldf('select * from main.A join main.B using(Tid)')
On Thu, May 20, 2010 at 12:04 PM, Nick Switanek <nswitanek at gmail.com> wrote:
> Hello,
>
> I'm having trouble discovering what's going wrong with my use of natural
> joins via sqldf.
>
> Following the instructions under 4i at http://code.google.com/p/sqldf/,
> which discusses creating indices to speed joins, I have been only unreliably
> able to get natural joins to work.
>
> For example,
>
>> Tid <- c('AES 01-01-02 10:58:00', 'AES 01-01-02 11:53:00', 'AES 01-01-05
> 10:58:00', 'AES 01-01-11 12:30:00')
>> A <- data.frame(Tid, dfName = 'a')
>> B <- data.frame(Tid = Tid[2:4], dfName = 'b')
>> C <- data.frame(Tid = Tid[1:3], dfName = 'c')
>
> # then use the sqldf library
>> library(sqldf)
>> sqldf()
>
> # to create indices on the Tid variable shared across data.frames
>> sqldf('create index indA on A(Tid)')
>> sqldf('create index indB on B(Tid)')
>> sqldf('create index indC on C(Tid)')
>
> # check to make sure everything is there
>> sqldf('select * from sqlite_master')
>
> # doing a natural join (implicitly on Tid)
> # does not give the expected joins
>> sqldf('select * from main.A natural join main.B')
> [1] Tid dfName
> <0 rows> (or 0-length row.names)
>> sqldf('select * from main.A natural join main.C')
> [1] Tid dfName
> <0 rows> (or 0-length row.names)
>> sqldf('select * from main.B natural join main.C')
> [1] Tid dfName
> <0 rows> (or 0-length row.names)
>
> # even using a where clause (which doesn't have the efficiency qualities I
> need the indexed natural joins for) is problematic, setting values of the
> dfName variable incorrectly for the data from C
>> sqldf('select * from main.B b, main.C c where b.Tid = c.Tid')
> Tid dfName Tid dfName
> 1 AES 01-01-02 11:53:00 b AES 01-01-02 11:53:00 b
> 2 AES 01-01-05 10:58:00 b AES 01-01-05 10:58:00 b
>
> I'm grateful for your guidance on what I'm doing wrong with the natural join
> in sqldf.
>
> many thanks,
> Nick
More information about the R-help
mailing list