[R] Fast update of a lot of records in a database?

McGehee, Robert Robert.McGehee at geodecapital.com
Fri May 19 22:13:20 CEST 2006


Hadley,
There are several reasons that running one large load and one large
update would be significantly faster than thousands of individual
updates. 

First, the time it takes to execute a query does not grow linearly with
the size of a query. That is, the statement: "SELECT TOP 100 * FROM
table" takes about 1.8 times as long as "SELECT TOP 10 * FROM table",
not 10 times longer (using an estimated query cost on tables in my
database using MS-SQL). The reason is that SQL is optimized to perform
well for large queries, and many of the steps used in a SQL operation
are needlessly repeated when multiple UPDATE/SELECT statements are given
rather than one large UPDATE/SELECT.

For instance, on most SQL UPDATES, the most time is spent primarily on
1) Sorting the input, 2) performing a clustered index seek, and 3)
performing a clustered index update. In a toy example using UPDATE, the
physical operation of sorting 2000 rows takes only 6 times longer than
sorting a little over 100 rows. The clustered index seek and update take
about 10 times longer. This, however, is far less than the 20x longer we
would expect from doing a linear row-by-row update. So even if it takes
an additional 50% longer to first load the data into a temporary table,
we still see the opportunity for large speed increases.

A second reason we would expect one large query to run faster is that it
is much easier to parallel process on multiple processors. That is, one
processor can be joining the tables while a second processor
simultaneously is performing clustered indexing. For a bunch of single
UPDATE statements, we are forced to run the operation in serial. Thus,
if the above examples were more complicated, we should expect an even
larger cost savings / row.

>From your example, a third reason is that in multiple updates, the SQL
server (at least my MS-SQL server) updates the transaction log after
every query (unless you wisely run in batch mode as Duncan did with his
BEGIN/COMMIT syntax), and thus significant more I/O time is spent
between each UPDATE statement. For instance, the RODBC sqlUpdate
function does not take advantage of transaction control, so to speed up
long queries, I've often resorted to sending over temporary tables (as I
suggested here), stored procedures, or even data stored as XML tables.

Lastly, removing your indices before the update would likely only slow
down the query. If the table is not indexed on the id key, then the SQL
server has to search through the entire table to find the matching id
before it can be updated. It would be like searching through a
dictionary that wasn't in alphabetical order. That said, indices can
slow down queries when a significant number of rows are being added, as
you then have to reindex the table when the insert completes.  However,
Duncan isn't doing that here.

Best,
Robert

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of hadley wickham
Sent: Friday, May 19, 2006 3:20 PM
To: Duncan Murdoch
Cc: Robert.McGehee at geodecapital.com; r-help at stat.math.ethz.ch;
br44114 at gmail.com
Subject: Re: [R] Fast update of a lot of records in a database?

> put the updates into a temporary table called updates
>
> UPDATE bigtable AS a
>    FROM updates AS b
>    WHERE a.id = b.id
>    SET a.col1 = b.col1

I don't think this will be any faster - why would creating a new table
be faster than updating existing rows?

I've never had a problem with using large numbers of SQL update
statements (in the order of hundreds of thousands) to update a table
and having them complete in a reasonable time (a few minutes).  How
heavily indexed is the field you are updating?  You may be able to get
some speed improvements by turning off indices before the update and
back on again afterwards (highly dependent on your database system
though).

I would strongly suspect your bottleneck lies elsewhere (eg. when
generating the statements in R, or using ODBC to send them)

Hadley

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html



More information about the R-help mailing list