[R] reshape2's dcast() Adds NAs to Data Frame
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Thu Aug 9 08:03:35 CEST 2012
I took a closer look, and unused factor levels is not the problem... the
problem is defining id variables appropriately.
1) "sample" is the name of a builtin function, so it is not advisable to
use it as the name of data.
I have used "samp" instead of "sample"
2) Your input data is essentially in long form already, so you don't need
to melt it.
3) It is almost never a good idea to use a floating point column as an id
variable.
Perhaps you were imagining something like:
> samp.cast <- dcast(samp[,1:5], site+sampdate+era~param,
value.var="quant" )
> str(samp.cast)
'data.frame': 35 obs. of 57 variables:
$ site : Factor w/ 5 levels "D-1","D-2","D-3",..: 1 1 1 2 2 2 2 2 2 2
...
$ sampdate: Date, format: "2007-12-12" "2008-03-15" "2009-09-02"
"2010-06-10" ...
$ era : Factor w/ 2 levels "Post","Pre": 1 1 1 1 1 1 1 1 1 1 ...
$ AgDis : num NA NA NA NA NA NA NA NA NA NA ...
$ AgTot : num 0.00013 0.00013 0.00013 0.00013 0.00013 0.00013 0.00013
0.00013 0.00013 0.00013 ...
$ AlDis : num NA NA NA NA NA NA NA NA NA NA ...
$ AlTot : num 0.106 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 ...
$ Alk : num 231 228 208 217 226 214 194 187 179 188 ...
$ AsDis : num NA NA NA NA NA NA NA NA NA NA ...
$ AsTot : num 0.0113 0.0008 0.0008 0.0017 0.0027 0.0007 0.0022 0.0029
0.0023 0.0027 ...
$ BaDis : num NA NA NA NA NA NA NA NA NA NA ...
$ BeDis : num NA NA NA NA NA NA NA NA NA NA ...
$ BeTot : num 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005
0.005 ...
$ BiDis : num NA NA NA NA NA NA NA NA NA NA ...
$ CaDis : num NA NA NA NA NA NA NA NA NA NA ...
$ CaTot : num 100 88.4 163 200 244 0.04 122 112 98.4 103 ...
$ CdDis : num NA NA NA NA NA NA NA NA NA NA ...
$ CdTot : num 2e-04 2e-04 2e-04 2e-04 2e-04 2e-04 2e-04 2e-04 2e-04
2e-04 ...
$ ClTot : num 1.43 1.34 13.7 16.8 19.1 15.1 10.9 9.37 8.49 10.4 ...
$ CoDis : num NA NA NA NA NA NA NA NA NA NA ...
$ CrDis : num 0.006 0.006 0.006 0.006 0.006 0.006 0.006 0.006 0.006
0.006 ...
$ CrTot : num NA NA NA NA NA NA NA NA NA NA ...
$ CuDis : num NA NA NA NA NA NA NA NA NA NA ...
$ CuTot : num 0.0239 0.0137 0.0015 0.00106 0.00106 0.00353 0.00108
0.009 0.00236 0.00144 ...
$ DO : num 4.96 9.91 6.98 6.2 6.47 5.73 5.84 5.74 6.12 6.39 ...
$ FeDis : num NA NA NA NA NA NA NA NA NA NA ...
$ FeTot : num 4.11 0.309 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.384 ...
$ HgDis : num NA NA NA NA NA NA NA NA NA NA ...
$ HgTot : num NA 5.00e-05 5.00e-05 7.22e-07 1.93e-06 6.82e-07
6.56e-07 1.06e-06 1.41e-06 2.58e-05 ...
$ MgDis : num NA NA NA NA NA NA NA NA NA NA ...
$ MgTot : num 9.56 9.15 14.6 22.4 27 0.06 13.7 12.8 11 11.4 ...
$ MnDis : num NA NA NA NA NA NA NA NA NA NA ...
$ MnTot : num 0.0348 0.0474 0.0231 0.004 0.004 0.004 0.004 0.004
0.004 0.0049 ...
$ MoDis : num NA NA NA NA NA NA NA NA NA NA ...
$ N : num 0.293 0.05 15.8 41.2 54.7 34.5 16.7 13.9 10.4 11.9 ...
$ NH4 : num 0.97 0.82 0.036 0.03 0.06 0.03 0.034 0.045 0.03 0.031
...
$ NaDis : num NA NA NA NA NA NA NA NA NA NA ...
$ NiDis : num NA NA NA NA NA NA NA NA NA NA ...
$ NiTot : num 0.01 0.224 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
$ PbDis : num NA NA NA NA NA NA NA NA NA NA ...
$ PbTot : num 0.0253 0.0083 0.00596 0.0003 0.0003 0.0003 0.0003
0.00129 0.0003 0.000599 ...
$ Pdis : num NA NA NA NA NA NA NA NA NA NA ...
$ SC : num 630 633 853 1129 1303 ...
$ SO4 : num 65.8 75.4 159 226 268 167 101 83.3 69.9 61.3 ...
$ SbDis : num 0.000825 0.000825 0.000825 0.000825 0.000825 0.000825
0.000825 0.000825 0.000825 0.000825 ...
$ SbTot : num NA NA NA NA NA NA NA NA NA NA ...
$ SeDis : num NA NA NA NA NA NA NA NA NA NA ...
$ SeTot : num 0.00132 0.00122 0.00125 0.00181 0.00131 0.00114 0.00125
0.00125 0.00125 0.00138 ...
$ SrDis : num NA NA NA NA NA NA NA NA NA NA ...
$ TDS : num 320 300 581 822 1020 662 507 418 335 385 ...
$ TSS : num NA NA NA NA NA NA NA NA NA NA ...
$ TlDis : num NA NA NA NA NA NA NA NA NA NA ...
$ TlTot : num 3e-04 3e-04 3e-04 3e-04 3e-04 3e-04 3e-04 3e-04 3e-04
3e-04 ...
$ Vdis : num NA NA NA NA NA NA NA NA NA NA ...
$ ZnDis : num NA NA NA NA NA NA NA NA NA NA ...
$ ZnTot : num 11.4 12.4 2.42 0.0406 0.0462 0.0318 0.0179 0.032 0.0178
0.0362 ...
$ pH : num 7.8 7.94 6.9 7.18 6.8 7.09 7.24 7.09 7.49 7.46 ...
Clearly, this still includes NA values, but if we look at the input data
corresponding to
the first row:
> subset(samp,(site=="D-1")&("2007-12-12"==sampdate)&("Post"==era))
site sampdate era param quant ceneq1 floor ceiling
1 D-1 2007-12-12 Post AgTot 1.30e-04 TRUE 0.00e+00 1.30e-04
2 D-1 2007-12-12 Post AlTot 1.06e-01 FALSE 1.06e-01 1.06e-01
3 D-1 2007-12-12 Post Alk 2.31e+02 FALSE 2.31e+02 2.31e+02
4 D-1 2007-12-12 Post AsTot 1.13e-02 FALSE 1.13e-02 1.13e-02
5 D-1 2007-12-12 Post BeTot 5.00e-03 TRUE 0.00e+00 5.00e-03
6 D-1 2007-12-12 Post CaTot 1.00e+02 FALSE 1.00e+02 1.00e+02
7 D-1 2007-12-12 Post CdTot 2.00e-04 TRUE 0.00e+00 2.00e-04
8 D-1 2007-12-12 Post ClTot 1.43e+00 FALSE 1.43e+00 1.43e+00
9 D-1 2007-12-12 Post CrDis 6.00e-03 TRUE 0.00e+00 6.00e-03
10 D-1 2007-12-12 Post CuTot 2.39e-02 FALSE 2.39e-02 2.39e-02
11 D-1 2007-12-12 Post DO 4.96e+00 FALSE 4.96e+00 4.96e+00
12 D-1 2007-12-12 Post FeTot 4.11e+00 FALSE 4.11e+00 4.11e+00
13 D-1 2007-12-12 Post MgTot 9.56e+00 FALSE 9.56e+00 9.56e+00
14 D-1 2007-12-12 Post MnTot 3.48e-02 FALSE 3.48e-02 3.48e-02
15 D-1 2007-12-12 Post N 2.93e-01 FALSE 2.93e-01 2.93e-01
16 D-1 2007-12-12 Post NH4 9.70e-01 FALSE 9.70e-01 9.70e-01
17 D-1 2007-12-12 Post NiTot 1.00e-02 TRUE 0.00e+00 1.00e-02
18 D-1 2007-12-12 Post PbTot 2.53e-02 FALSE 2.53e-02 2.53e-02
19 D-1 2007-12-12 Post SC 6.30e+02 FALSE 6.30e+02 6.30e+02
20 D-1 2007-12-12 Post SO4 6.58e+01 FALSE 6.58e+01 6.58e+01
21 D-1 2007-12-12 Post SbDis 8.25e-04 TRUE 0.00e+00 8.25e-04
22 D-1 2007-12-12 Post SeTot 1.32e-03 FALSE 1.32e-03 1.32e-03
23 D-1 2007-12-12 Post TDS 3.20e+02 FALSE 3.20e+02 3.20e+02
24 D-1 2007-12-12 Post TlTot 3.00e-04 TRUE 0.00e+00 3.00e-04
25 D-1 2007-12-12 Post ZnTot 1.14e+01 FALSE 1.14e+01 1.14e+01
26 D-1 2007-12-12 Post pH 7.80e+00 FALSE 7.80e+00 7.80e+00
There are only 26 chemicals corresponding to that row, but there
are a total of 54 different possible chemicals to quantify in the
first row. Thus, there must be NA values inserted to fill out the
data frame. (The problem gets worse when you try to keep those
other data columns as id columns... they represent additional
distinct combinations so you end up with more rows and fewer values in
each row.)
I am not familiar with the NADA library, so I cannot suggest what you
SHOULD be doing, but it does seem that you should perhaps study some more
examples of its use to figure out what form you should have your data in.
On Wed, 8 Aug 2012, arun wrote:
> Hi,
>
> I tried converting factors to character, but the results still has NAs.
> convert.type1 <- function(obj,types){
> for (i in 1:length(obj)){
> FUN <- switch(types[i],character = as.character,
> numeric = as.numeric,
> factor = as.factor,
> Date=as.Date.character,
> logical=as.logical)
> obj[,i] <- FUN(obj[,i])
> }
> obj
> }
>
> sample.melt1<-convert.type1(sample.melt,c("character","Date","character","character","logical","numeric","numeric","character","numeric"))
> str(sample.melt1)
> #'data.frame': 715 obs. of 9 variables:
> # $ site : chr "D-1" "D-1" "D-1" "D-1" ...
> # $ sampdate: Date, format: "2007-12-12" "2007-12-12" ...
> # $ era : chr "Post" "Post" "Post" "Post" ...
> # $ param : chr "AgTot" "AlTot" "Alk" "AsTot" ...
> # $ ceneq1 : logi TRUE FALSE FALSE FALSE TRUE FALSE ...
> # $ floor : num 0 0.106 231 0.0113 0 100 0 1.43 0 0.0239 ...
> # $ ceiling : num 1.30e-04 1.06e-01 2.31e+02 1.13e-02 5.00e-03 1.00e+02 2.00e-04 1.43 6.00e-03 2.39e-02 ...
> # $ variable: chr "quant" "quant" "quant" "quant" ...
> # $ value : num 1.30e-04 1.06e-01 2.31e+02 1.13e-02 5.00e-03 1.00e+02 2.00e-04 1.43 6.00e-03 2.39e-02 ...
>
> sample.cast <- dcast(sample.melt1, site + sampdate + era + ceneq1 + floor + ceiling ~ param)
> head(sample.cast)
> #site sampdate era ceneq1 floor ceiling AgDis AgTot AlDis Alk AlTot AsDis
> #1 D-1 2007-12-12 Post FALSE 0.00132 0.00132 NA NA NA NA NA NA
> #2 D-1 2007-12-12 Post FALSE 0.01130 0.01130 NA NA NA NA NA NA
> #3 D-1 2007-12-12 Post FALSE 0.02390 0.02390 NA NA NA NA NA NA
> #4 D-1 2007-12-12 Post FALSE 0.02530 0.02530 NA NA NA NA NA NA
> #5 D-1 2007-12-12 Post FALSE 0.03480 0.03480 NA NA NA NA NA NA
> #6 D-1 2007-12-12 Post FALSE 0.10600 0.10600 NA NA NA NA 0.106 NA
> #---------------------------------------------
> #---------------------------------------------
> #SO4 SrDis TDS TlDis TlTot TSS Vdis ZnDis ZnTot
> #1 NA NA NA NA NA NA NA NA NA
> #2 NA NA NA NA NA NA NA NA NA
> #3 NA NA NA NA NA NA NA NA NA
> #4 NA NA NA NA NA NA NA NA NA
> #5 NA NA NA NA NA NA NA NA NA
> #6 NA NA NA NA NA NA NA NA NA
>
> A.K.
>
>
>
>
>
> ----- Original Message -----
> From: Rich Shepard <rshepard at appl-ecosys.com>
> To: R help <r-help at r-project.org>
> Cc:
> Sent: Wednesday, August 8, 2012 10:48 PM
> Subject: Re: [R] reshape2's dcast() Adds NAs to Data Frame
>
> On Wed, 8 Aug 2012, Jeff Newmiller wrote:
>
>> The explanation is that this is normal and consistent with behavior of
>> factors in general. If you don't want that, it is common to work with
>> character data instead of factors, only converting to factor when needed.
>> In most cases I invoke read.table with the as.is=TRUE argument and delay
>> converting to factors until I need them. Other people convert from factor
>> to character and back to factor to get rid of unwanted factor levels on an
>> as-needed basis.
>
> Jeff,
>
> First thing tomorrow I will research the difference between characters and
> data; I assumed they were the same.
>
> Thanks,
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
More information about the R-help
mailing list