[Rd] RSQLite indexing
Kasper Daniel Hansen
khansen at stat.Berkeley.EDU
Tue Oct 23 03:07:18 CEST 2007
On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote:
>
> I am trying to use RSQLite for storing data and I need to create
> indexes on
> two variables in the table. It appears from searching the web that
> the CREATE
> INDEX operation in SQLite is relatively slow for large files, and
> this has been
> my experience as well.
>
> The two index variables are crossed. One has about 350,000 levels
> [yes, it's
> genetic association data]. The other will have about 4000 levels
> eventually,
> but is up to about 100 now. When the data were entered they were
> already ordered by this second index variable.
>
> Creating the index took about an hour on the 100-level, presorted
> variable and about 12 hours on the 350,000-level unsorted
> variable. I'm looking for advice on how to reduce this. Specifically
> 1/ would it be faster if the variable with more levels was the
> presorted one?
> 2/ would it be faster or slower if the index were created before
> adding all the data?
> 3/ are there any options that can be set to speed up the indexing?
>
> The SQLite database will not be the primary archive for the data,
> so optimizations that are risky in the case of power loss or
> hardware failure are still acceptable. Since Bioconductor seems to
> use SQLite a lot I'm hoping there is some simple solution.
I have not used RSQLite, but have some experience doing this thing
for big sqlite databases using the command line client. Every
database in sqlite has a number of parameters associated with it. You
want to make sure that cache_size is at _most_ 2000 (yes, I know this
is totally counterintuitive as it tells sqlite to use as little
memory as possible). You also tell it to be non-synchronous. In
sqlite the commands are
sqlite> pragma default_cache_size = 2000
sqlite> pragma_synchronous = off
You can test the setting of these parameters by just doing a
sqlite> pragma default_cache_size
As far as I remember, cache size can only be set when you create the
database. I have no idea how RSQlite handles it.
When I asked about this problem on the sqlite mailing list, the
sqlite-creator said that this was a "locality of reference problem"
and that it was being "worked on". And that I could search the
archives for more info (which did not help me back then).
I don't know whether or not sorting helps.
Another thing to do is to check in what amount sqlite sits idle while
doing I/O. It is probably impossible to avoid some idleness with such
a thing, but it should of course be kept to a minimum.
It is true that some of the other databases are probably much faster
at creating indices. But in the post-index analysis, sqlite is a
really fast database, probably amongst the fastest there is. It does
not do a good job a converting your queries into smart queries, but
if you are doing something simple, it is blazingly fast with the
right user options.
Kasper
More information about the R-devel
mailing list