[R] Data type problem when extract data from SQLite to R by using RSQLite

chen jia chen_1002 at fisher.osu.edu
Tue Mar 1 03:57:13 CET 2011


Hi Seth,

Thanks for the reply. I provide info from sessionInfo() and about
schema that you ask. Please take a look.

The output from sessionInfo() is
> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UT> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] plyr_1.2.1     RSQLite_0.9-2  DBI_0.2-5      filehash_2.1-1F-8
  LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] plyr_1.2.1     RSQLite_0.9-2  DBI_0.2-5      filehash_2.1-1

The .schema of table annual_data3 is
sqlite> .schema annual_data3
CREATE TABLE "annual_data3"(
  PERMNO INT,
  DATE INT,
  CUSIP TEXT,
  EXCHCD INT,
  SICCD INT,
  SHROUT INT,
  PRC REAL,
  RET REAL,
  ...
  pret_var,
  pRET_sd,
  nmret,
  pya_var,
  pya_sd,
  nya,
  pya_var_ebi,
  pya_sd_ebi,
  pya_var_ebit,
  pya_sd_ebit,
  pya_var_ebitda,
  pya_sd_ebitda,
  logage REAL,
  logasset REAL,
  ...
  loglead1stdaret, loglead2stdaret)

Table annual_data3 is created by joining table annual_data2 and
ya_vol. The column pya_var is initially in ya_vol.

dbGetQuery(sql.industry,
           "create table annual_data3 as
            select a.*, b.pya_var, b.pya_sd,
                   b.nya, b.pya_var_ebi, b.pya_sd_ebi,
                   b.pya_var_ebit, b.pya_sd_ebit,
                   b.pya_var_ebitda, b.pya_sd_ebitda
            from   annual_data2 as a left join
                      ya_vol as b
            on     a.permno = b.permno and
                     a.year = b.year
            order by permno, year")

Table ya_vol is created by
dbGetQuery(sql.industry,
           "create table ya_vol as
            select PERMNO, year,
                   variance(ya) as pya_var,
                   stdev(ya) as pya_sd,
                   count(*) as nya,
                   variance(ya_ebi) as pya_var_ebi,
                   stdev(ya_ebi) as pya_sd_ebi,
                   variance(ya_ebit) as pya_var_ebit,
                   stdev(ya_ebit) as pya_sd_ebit,
                   variance(ya_ebitda) as pya_var_ebitda,
                   stdev(ya_ebitda) as pya_sd_ebitda
            from   past_ya
            where  ya is not null
            group by PERMNO, year
            order by PERMNO, year")

The schema info of ya_vol is
sqlite> .schema ya_vol
CREATE TABLE ya_vol(
  PERMNO INT,
  year INT,
  pya_var,
  pya_sd,
  nya,
  pya_var_ebi,
  pya_sd_ebi,
  pya_var_ebit,
  pya_sd_ebit,
  pya_var_ebitda,
  pya_sd_ebitda
);
CREATE INDEX ya_vol_permno_year_idx
            on ya_vol (permno,year);

Interestingly, I find that the problem I reported does not for columns
labeled real in the schema info. For example, the type of column RET
never changes no matter what the first observation is.

> str(dbGetQuery(sql.industry,
+                "select RET from annual_data3
+                 where RET is not null limit 5"))
'data.frame':	5 obs. of  1 variable:
 $ RET: num  -0.03354 -0.02113 0.03797 0.0013 -0.00678
>
> str(dbGetQuery(sql.industry,
+                "select RET from annual_data3
+                 where RET is null limit 5"))
'data.frame':	5 obs. of  1 variable:
 $ RET: num  NA NA NA NA NA
> sapply(dbGetQuery(sql.industry,
+                "select RET from annual_data3
+                 where RET is null limit 5"),
+        typeof)
     RET
"double"
> sapply(dbGetQuery(sql.industry,
+                "select RET from annual_data3
+                 where RET is not null limit 5"),
+        typeof)
     RET
"double"

I still don't know how to solve this problem for variable pya_var,
please help. Thanks.

Best,
Jia


On Mon, Feb 28, 2011 at 6:48 PM, Seth Falcon <seth at userprimary.net> wrote:
> Hi Jia,
>
> On Mon, Feb 28, 2011 at 12:37 PM, chen jia <chen_1002 at fisher.osu.edu> wrote:
>> When I extract data from SQLite to R, the data types (or modes) of the
>> extracted data seems to be determined by the value of the first row.
>> Please see the following example.
>
> It would help to provide the output of sessionInfo() as well as the
> schema definition for the table in SQLite (or at least description of
> how it was created).
>
> Here's an example that works as you'd like:
>
>    > library(RSQLite)
>    > db = dbConnect(SQLite(), dbname = ":memory:")
>    > dbGetQuery(db, "create table t (a int, b real, c text)")
>    > df = data.frame(a=c(NA, 1L, 2L), b=c(NA, 1.1, 2.2), c=c(NA, "x",
> "y"),stringsAsFactors=FALSE)
>    > df
>       a   b    c
>    1 NA  NA <NA>
>    2  1 1.1    x
>    3  2 2.2    y
>    > dbGetPreparedQuery(db, "insert into t values (?, ?, ?)", df)
>    > dbGetQuery(db, "select * from t")
>       a   b    c
>    1 NA  NA <NA>
>    2  1 1.1    x
>    3  2 2.2    y
>    > sapply(dbGetQuery(db, "select * from t"), typeof)
>              a           b           c
>      "integer"    "double" "character"
>    > sapply(dbGetQuery(db, "select * from t limit 1"), typeof)
>              a           b           c
>      "integer"    "double" "character"
>    > sapply(dbGetQuery(db, "select a from t limit 1"), typeof)
>            a
>    "integer"
>    > sapply(dbGetQuery(db, "select a from t limit 2"), typeof)
>            a
>    "integer"
>    > sapply(dbGetQuery(db, "select a from t limit 1"), typeof)
>            a
>    "integer"
>
>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> x86_64-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
>
> other attached packages:
> [1] RSQLite_0.9-4 DBI_0.2-5
>
> loaded via a namespace (and not attached):
> [1] tools_2.11.1
>
>
>
>
> --
> Seth Falcon | @sfalcon | http://userprimary.net/
>



-- 
700 Fisher Hall
2100 Neil Ave.
Columbus, Ohio  43210
http://www.fisher.osu.edu/~chen_1002/



More information about the R-help mailing list