[R-sig-DB] Parameterised queries

Tim Keitt tke|tt @end|ng |rom utex@@@edu
Thu Feb 12 15:56:10 CET 2015


On Thu, Feb 12, 2015 at 8:20 AM, Ian Gow <iandgow using gmail.com> wrote:

> Is there merit in using PostgreSQL's arrays to send data from R to
> PostgreSQL? (Caveat: I say this having no idea how PL/R gets vectors from R
> to PostgreSQL.)
>

Not in terms of performance. Communication between the client and server is
one byte at a time. The reason inserts are slow is that the server does a
lot of validation, updates indices, triggers, etc. It is also a lot safer
than copy, which just writes whatever it sees to the fields.

If you want to see how things are copied byte-by-byte, you can trace the
communication between R and PostgreSQL in 'rpg'.


>
> PL/R is able to pull vectors in from R, though RPostgreSQL cannot pulls
> arrays from PostgreSQL as vectors (though it would be great to have this
> feature).
>

I've forgotten how DBI works, but an array should just come through in text
form and you can parse it how ever you want on the R side. You can also
send an array by converting it to a string of the appropriate format.

rpg has a 'format_for_send' method that you can dispatch on any type to
convert it to something PostgreSQL can read. I'm working on something for
formatting strings returned from the server.

THK


>
> The second and third approaches below do not differ in performance.
>
> # Getting data one vector at a time
>
> DROP FUNCTION IF EXISTS get_int_vector();
>
> CREATE OR REPLACE FUNCTION get_int_vector()
>   RETURNS integer[] AS
> $BODY$
>      sample(1:10)
> $BODY$ LANGUAGE plr;
>
> DROP FUNCTION IF EXISTS get_str_vector();
>
> CREATE OR REPLACE FUNCTION get_str_vector()
>   RETURNS text[] AS
> $BODY$
>      sample(letters[1:10])
> $BODY$ LANGUAGE plr;
>
> DROP FUNCTION IF EXISTS get_date_vector();
>
> -- For some reason, dates get converted to integers.
> CREATE OR REPLACE FUNCTION get_date_vector()
>   RETURNS date[] AS
> $BODY$
>      as.character(seq(from=as.Date("1971-01-01"), by=1, length.out = 10))
> $BODY$ LANGUAGE plr;
>
>
> WITH raw_data AS (
>     SELECT get_int_vector() AS ints,
>         get_str_vector() AS strs,
>         get_date_vector() AS dates)
> SELECT
>     UNNEST(ints) AS int,
>     UNNEST(strs) AS str,
>     UNNEST(dates) AS date
> FROM raw_data;
>
> ## Via RPostgreSQL
>
> system.time({
>     n <- 1e6
>     df <- data.frame(int=sample(1:10, size=n, replace=TRUE),
>                      str=sample(letters[1:10], size=n, replace=TRUE),
>
>  date=as.character(sample(seq(from=as.Date("1971-01-01"),
>                                                   by=1, length.out = 10),
>                                               size=n, replace=TRUE)),
>                      stringsAsFactors=FALSE)
>
>     library("RPostgreSQL")
>
>     pg <- dbConnect(PostgreSQL())
>
>     dbWriteTable(pg, "rpostgresql_test", df, row.names=FALSE,
> overwrite=TRUE)
> })
>
> ## Using composite type created with table
>
> DROP TABLE IF EXISTS plr_test CASCADE;
>
> CREATE TABLE plr_test (int integer, str text, date date);
>
> CREATE OR REPLACE FUNCTION get_df()
>   RETURNS SETOF plr_test AS
> $BODY$
>     n <- 1e6
>     df <- data.frame(int=sample(1:10, size=n, replace=TRUE),
>                     str=sample(letters[1:10], size=n, replace=TRUE),
>
> date=as.character(sample(seq(from=as.Date("1971-01-01"),
>                                                 by=1, length.out = 10),
>                                              size=n, replace=TRUE)),
>                     stringsAsFactors=FALSE)
>     return(df)
> $BODY$ LANGUAGE plr;
>
> INSERT INTO plr_test
> SELECT * FROM get_df();
>
>
> > On Feb 12, 2015, at 6:00 AM, r-sig-db-request using r-project.org wrote:
> >
> > Send R-sig-DB mailing list submissions to
> >       r-sig-db using r-project.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >       https://stat.ethz.ch/mailman/listinfo/r-sig-db
> > or, via email, send a message with subject or body 'help' to
> >       r-sig-db-request using r-project.org
> >
> > You can reach the person managing the list at
> >       r-sig-db-owner using r-project.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of R-sig-DB digest..."
> >
> >
> > Today's Topics:
> >
> >   1. Re: Parameterised queries (Tim Keitt)
> >   2. Re: Parameterised queries (Hadley Wickham)
> >   3. Re: Parameterised queries (Tim Keitt)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Wed, 11 Feb 2015 15:05:08 -0600
> > From: Tim Keitt <tkeitt using utexas.edu>
> > To: Hadley Wickham <h.wickham using gmail.com>
> > Cc: "r-sig-db using r-project.org" <r-sig-db using r-project.org>
> > Subject: Re: [R-sig-DB] Parameterised queries
> > Message-ID:
> >       <CANnL8gr1MCX-LFjFHWyoQSg-DD6Tetbp0iBGx0=
> Ht2LVH1Vp4A using mail.gmail.com>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > On Wed, Feb 11, 2015 at 2:41 PM, Hadley Wickham <h.wickham using gmail.com>
> wrote:
> >
> >>>> It gives a new attack vector - to introduce additional data into the
> >>>> database, you just need to figure out how to turn a length 1 vector in
> >>>> to a length 2 vector.
> >>>>
> >>>> It's dangerous in the same way that allowing dbGetQuery() to execute
> >>>> multiple queries is dangerous.
> >>>
> >>> I'd rather hope that if it were a case that mattered, the user would
> not
> >>> rely on the api as a substitute for appropriate checks.
> >>
> >> I think the API should be as safe as possible by default, and
> >> sacrificing safety for speed should only be done explicitly when the
> >> user asks for it.
> >>
> >
> > My use cases are not so sensitive, but I agree with the general idea.
> Also,
> > you really do not gain much over regular looping as inserts are really
> > slow, at least in postgresql.
> >
> > THK
> >
> >
> >>
> >> Hadley
> >>
> >> --
> >> http://had.co.nz/
> >>
> >
> >
> >
> > --
> > http://www.keittlab.org/
> >
> >       [[alternative HTML version deleted]]
> >
> >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Wed, 11 Feb 2015 15:07:45 -0600
> > From: Hadley Wickham <h.wickham using gmail.com>
> > To: Tim Keitt <tkeitt using utexas.edu>
> > Cc: "r-sig-db using r-project.org" <r-sig-db using r-project.org>
> > Subject: Re: [R-sig-DB] Parameterised queries
> > Message-ID:
> >       <CABdHhvE8nJ-0gs=
> cTxCQNPifdrnBjy-ZAo1d6aWLr7W0Uw8Ddg using mail.gmail.com>
> > Content-Type: text/plain; charset=UTF-8
> >
> >>> I think the API should be as safe as possible by default, and
> >>> sacrificing safety for speed should only be done explicitly when the
> >>> user asks for it.
> >>
> >> My use cases are not so sensitive, but I agree with the general idea.
> Also,
> >> you really do not gain much over regular looping as inserts are really
> slow,
> >> at least in postgresql.
> >
> > Yes, I'm just working on that for RPostgres. Parameterised inserts are
> > actually slower than sending one giant SQL string. RPostgreSQL uses
> > COPY ... FROM STDIN (and manually converts the data frame into a
> > single c string) for better performance
> >
> > Hadley
> >
> > --
> > http://had.co.nz/
> >
> >
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Wed, 11 Feb 2015 15:11:23 -0600
> > From: Tim Keitt <tkeitt using utexas.edu>
> > To: Hadley Wickham <h.wickham using gmail.com>
> > Cc: "r-sig-db using r-project.org" <r-sig-db using r-project.org>
> > Subject: Re: [R-sig-DB] Parameterised queries
> > Message-ID:
> >       <
> CANnL8gpVrM-_HoZMZJi2d2ARAfpEv97uV9JjKxwNpTDaGfDY2Q using mail.gmail.com>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > On Wed, Feb 11, 2015 at 3:07 PM, Hadley Wickham <h.wickham using gmail.com>
> wrote:
> >
> >>>> I think the API should be as safe as possible by default, and
> >>>> sacrificing safety for speed should only be done explicitly when the
> >>>> user asks for it.
> >>>
> >>> My use cases are not so sensitive, but I agree with the general idea.
> >> Also,
> >>> you really do not gain much over regular looping as inserts are really
> >> slow,
> >>> at least in postgresql.
> >>
> >> Yes, I'm just working on that for RPostgres. Parameterised inserts are
> >> actually slower than sending one giant SQL string. RPostgreSQL uses
> >> COPY ... FROM STDIN (and manually converts the data frame into a
> >> single c string) for better performance
> >>
> >
> > I put copy_from and copy_to in rpg. I punted on the C interface and
> simply
> > shell out to psql in that case. Works fine with read.csv. and write.csv.
> >
> > THK
> >
> >
> >>
> >> Hadley
> >>
> >> --
> >> http://had.co.nz/
> >>
> >> _______________________________________________
> >> R-sig-DB mailing list -- R Special Interest Group
> >> R-sig-DB using r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/r-sig-db
> >>
> >
> >
> >
> > --
> > http://www.keittlab.org/
> >
> >       [[alternative HTML version deleted]]
> >
> >
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > R-sig-DB mailing list
> > R-sig-DB using r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-db
> >
> >
> > ------------------------------
> >
> > End of R-sig-DB Digest, Vol 119, Issue 5
> > ****************************************
>
> _______________________________________________
> R-sig-DB mailing list -- R Special Interest Group
> R-sig-DB using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-db
>



-- 
http://www.keittlab.org/

	[[alternative HTML version deleted]]




More information about the R-sig-DB mailing list