[R] a question on sqldf's handling of missing value and factor

Wed Mar 2 16:14:44 CET 2011

Dear Mr. Grothendieck :
thank you so much for your attention. You are the real expert here. the
following is a mock text file:
a	b	c
aa	 	23
aaa	34	 
aaaa	 	77

note that both b and c column contain missing value (blank)
I save it under my C drive and use both read.table and sqldf to import it to
R and then use identical() function to compare the result. The following is
the result:

> setwd("c:/")
> library(sqldf)
> test <- file("test.txt") 
> testx <- sqldf("select * from test", 
+                 dbname = tempfile(), file.format = list(header = T,
sep="\t", row.names = F))
> testy<- read.table("test.txt", header = T, sep="\t")
> identical(testx, testy)
[1] FALSE
> testx
     a    b    c
1   aa      23.0
2  aaa 34.6  0.0
3 aaaa      77.8
> testy
     a    b    c
1   aa   NA 23.0
2  aaa 34.6   NA
3 aaaa   NA 77.8
> class(testx$b)
[1] "factor"
> class(testy$b)
[1] "numeric"
> 

read.table seems to get it right while sqldf treats b as factor (if I add
method="raw", b become character). what is more troubling is that column C
has number 0 at the second row while in the original file it is missing. In
my real world situation with a much larger text file, the problem is that
many cells are empty when they all actually have values in the original text
file. 

I would greatly appreciate your help if you can shed some light on this.

thanks

--
View this message in context: http://r.789695.n4.nabble.com/a-question-on-sqldf-s-handling-of-missing-value-and-factor-tp3331007p3331662.html
Sent from the R help mailing list archive at Nabble.com.