[R] RSQLite slowness

Rasmus Liland jr@| @end|ng |rom po@teo@no
Wed Oct 6 18:23:15 CEST 2021


Dear r-help readers,

why is it so much slower to query an 
sqlite database using RSQlite «from the 
outside» using param like 

	statement <-
	 "SELECT * FROM gene2refseq
	  LEFT JOIN gene_info ON
	    gene_info.GeneID = gene2refseq.GeneID
	  WHERE gene2refseq.`RNA_nucleotide_accession.version` 
	    LIKE ?"
	db <- "gene_info.sqlite"
	conn <- DBI::dbConnect(RSQLite::SQLite(), db)
	x1 <- DBI::dbGetQuery(conn=conn, 
	  statement=statement, 
	  param=list(Håkan20210914$RNANucleotideAccession))

compared to querying «from the inside» 
of sqlite, by writing your search terms 
as a table first, and then calling it 

	statement <- 
	 "SELECT * FROM H
	  LEFT JOIN gene2refseq R ON
	    R.`RNA_nucleotide_accession.version` 
	    LIKE '%' || H.RNANucleotideAccession || '%'
	  LEFT JOIN gene_info I ON I.GeneID = R.GeneID"
	DBI::dbWriteTable(conn, "H", Håkan20210914)
	x2 <- DBI::dbGetQuery(conn=conn, statement=statement)
	DBI::dbDisconnect(conn)

On my system (E5-2603 v4), the first 
query took more than an hour, while the 
second took only a few minutes ...  

Do you guys know of any faster (but also 
nice) way to dig around in very large 
tsv files like 
https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz 
and 
https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz 
?

Best,
Rasmus

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20211006/671429f1/attachment.sig>


More information about the R-help mailing list