[R] combining data from different datasets
Barry Rowlingson
b.rowlingson at lancaster.ac.uk
Fri Oct 24 20:05:58 CEST 2008
2008/10/24 Gabor Grothendieck <ggrothendieck at gmail.com>:
> NA and "NA" are not the same:
>
>> DF <- data.frame(x = c("a", "NA", NA))
>> DF
> x
> 1 a
> 2 NA
> 3 <NA>
>>
>> is.na(NA)
> [1] TRUE
>> is.na("NA")
> [1] FALSE
Yes, but unless you tell it otherwise, read.table will think Namibia
is an NA, even in a column of alphabetic strings:
1,US
2,NA
3,UK
read.table("test.dat",sep=",")
V1 V2
1 1 US
2 2 <NA>
3 3 UK
So you think you can use na.strings? Calling with na.strings seems to
work on both columns, and hence converts columns with real NAs into
Factors. Here's some data:
$ cat test.dat
1,US
2,NA
3,UK
NA,FR
4,PT
We need column 1 to be integer with an NA, and column 2 to be text
with a real "NA" and not a <NA>:
Try #1 (NAive effort) reads NA(mibia) as NA(missing), keeps V1 as integers:
> read.table("test.dat",sep=",")
V1 V2
1 1 US
2 2 <NA>
3 3 UK
4 NA FR
5 4 PT
= FAIL
Try #2 reads NAmibia okay, but reads V1 as factor:
> read.table("test.dat",sep=",",na.strings="")
V1 V2
1 1 US
2 2 NA
3 3 UK
4 NA FR
5 4 PT
> str(read.table("test.dat",sep=",",na.strings=""))
'data.frame': 5 obs. of 2 variables:
$ V1: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 5 4
$ V2: Factor w/ 5 levels "FR","NA","PT",..: 5 2 4 1 3
= FAIL
#3 lets try colClasses:
> read.table("test.dat",sep=",",colClasses=c("numeric","character"))
V1 V2
1 1 US
2 2 <NA>
3 3 UK
4 NA FR
5 4 PT
= FAIL
#4 So... lets try to specify colClasses and na.strings:
> read.table("test.dat",sep=",",na.strings="",colClasses=c("numeric","character"))
V1 V2
1 1 US
2 2 NA
3 3 UK
4 NA FR
5 4 PT
- looks good:
> str(read.table("test.dat",sep=",",na.strings="",colClasses=c("numeric","character")))
'data.frame': 5 obs. of 2 variables:
$ V1: num 1 2 3 NA 4
$ V2: chr "US" "NA" "UK" "FR" ...
= WIN!
I'm not certain how that works. I guess the conversion of column 1 to
numeric causes the NA rather than the matching of it to the na.strings
parameter....
Barry
More information about the R-help
mailing list