[R] reshape2's dcast() Adds NAs to Data Frame

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Thu Aug 9 08:03:35 CEST 2012


I took a closer look, and unused factor levels is not the problem... the 
problem is defining id variables appropriately.

1) "sample" is the name of a builtin function, so it is not advisable to 
use it as the name of data.
I have used "samp" instead of "sample"

2) Your input data is essentially in long form already, so you don't need 
to melt it.

3) It is almost never a good idea to use a floating point column as an id 
variable.
Perhaps you were imagining something like:

   > samp.cast <- dcast(samp[,1:5], site+sampdate+era~param, 
value.var="quant" )
   > str(samp.cast)
   'data.frame':  35 obs. of  57 variables:
   $ site    : Factor w/ 5 levels "D-1","D-2","D-3",..: 1 1 1 2 2 2 2 2 2 2 
...
   $ sampdate: Date, format: "2007-12-12" "2008-03-15" "2009-09-02" 
"2010-06-10" ...
   $ era     : Factor w/ 2 levels "Post","Pre": 1 1 1 1 1 1 1 1 1 1 ...
   $ AgDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ AgTot   : num  0.00013 0.00013 0.00013 0.00013 0.00013 0.00013 0.00013 
0.00013 0.00013 0.00013 ...
   $ AlDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ AlTot   : num  0.106 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 ...
   $ Alk     : num  231 228 208 217 226 214 194 187 179 188 ...
   $ AsDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ AsTot   : num  0.0113 0.0008 0.0008 0.0017 0.0027 0.0007 0.0022 0.0029 
0.0023 0.0027 ...
   $ BaDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ BeDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ BeTot   : num  0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 
0.005 ...
   $ BiDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ CaDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ CaTot   : num  100 88.4 163 200 244 0.04 122 112 98.4 103 ...
   $ CdDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ CdTot   : num  2e-04 2e-04 2e-04 2e-04 2e-04 2e-04 2e-04 2e-04 2e-04 
2e-04 ...
   $ ClTot   : num  1.43 1.34 13.7 16.8 19.1 15.1 10.9 9.37 8.49 10.4 ...
   $ CoDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ CrDis   : num  0.006 0.006 0.006 0.006 0.006 0.006 0.006 0.006 0.006 
0.006 ...
   $ CrTot   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ CuDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ CuTot   : num  0.0239 0.0137 0.0015 0.00106 0.00106 0.00353 0.00108 
0.009 0.00236 0.00144 ...
   $ DO      : num  4.96 9.91 6.98 6.2 6.47 5.73 5.84 5.74 6.12 6.39 ...
   $ FeDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ FeTot   : num  4.11 0.309 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.384 ...
   $ HgDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ HgTot   : num  NA 5.00e-05 5.00e-05 7.22e-07 1.93e-06 6.82e-07 
6.56e-07 1.06e-06 1.41e-06 2.58e-05 ...
   $ MgDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ MgTot   : num  9.56 9.15 14.6 22.4 27 0.06 13.7 12.8 11 11.4 ...
   $ MnDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ MnTot   : num  0.0348 0.0474 0.0231 0.004 0.004 0.004 0.004 0.004 
0.004 0.0049 ...
   $ MoDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ N       : num  0.293 0.05 15.8 41.2 54.7 34.5 16.7 13.9 10.4 11.9 ...
   $ NH4     : num  0.97 0.82 0.036 0.03 0.06 0.03 0.034 0.045 0.03 0.031 
...
   $ NaDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ NiDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ NiTot   : num  0.01 0.224 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
   $ PbDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ PbTot   : num  0.0253 0.0083 0.00596 0.0003 0.0003 0.0003 0.0003 
0.00129 0.0003 0.000599 ...
   $ Pdis    : num  NA NA NA NA NA NA NA NA NA NA ...
   $ SC      : num  630 633 853 1129 1303 ...
   $ SO4     : num  65.8 75.4 159 226 268 167 101 83.3 69.9 61.3 ...
   $ SbDis   : num  0.000825 0.000825 0.000825 0.000825 0.000825 0.000825 
0.000825 0.000825 0.000825 0.000825 ...
   $ SbTot   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ SeDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ SeTot   : num  0.00132 0.00122 0.00125 0.00181 0.00131 0.00114 0.00125 
0.00125 0.00125 0.00138 ...
   $ SrDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ TDS     : num  320 300 581 822 1020 662 507 418 335 385 ...
   $ TSS     : num  NA NA NA NA NA NA NA NA NA NA ...
   $ TlDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ TlTot   : num  3e-04 3e-04 3e-04 3e-04 3e-04 3e-04 3e-04 3e-04 3e-04 
3e-04 ...
   $ Vdis    : num  NA NA NA NA NA NA NA NA NA NA ...
   $ ZnDis   : num  NA NA NA NA NA NA NA NA NA NA ...
   $ ZnTot   : num  11.4 12.4 2.42 0.0406 0.0462 0.0318 0.0179 0.032 0.0178 
0.0362 ...
   $ pH      : num  7.8 7.94 6.9 7.18 6.8 7.09 7.24 7.09 7.49 7.46 ...

Clearly, this still includes NA values, but if we look at the input data 
corresponding to
the first row:

   > subset(samp,(site=="D-1")&("2007-12-12"==sampdate)&("Post"==era))
   site   sampdate  era param    quant ceneq1    floor  ceiling
   1   D-1 2007-12-12 Post AgTot 1.30e-04   TRUE 0.00e+00 1.30e-04
   2   D-1 2007-12-12 Post AlTot 1.06e-01  FALSE 1.06e-01 1.06e-01
   3   D-1 2007-12-12 Post   Alk 2.31e+02  FALSE 2.31e+02 2.31e+02
   4   D-1 2007-12-12 Post AsTot 1.13e-02  FALSE 1.13e-02 1.13e-02
   5   D-1 2007-12-12 Post BeTot 5.00e-03   TRUE 0.00e+00 5.00e-03
   6   D-1 2007-12-12 Post CaTot 1.00e+02  FALSE 1.00e+02 1.00e+02
   7   D-1 2007-12-12 Post CdTot 2.00e-04   TRUE 0.00e+00 2.00e-04
   8   D-1 2007-12-12 Post ClTot 1.43e+00  FALSE 1.43e+00 1.43e+00
   9   D-1 2007-12-12 Post CrDis 6.00e-03   TRUE 0.00e+00 6.00e-03
   10  D-1 2007-12-12 Post CuTot 2.39e-02  FALSE 2.39e-02 2.39e-02
   11  D-1 2007-12-12 Post    DO 4.96e+00  FALSE 4.96e+00 4.96e+00
   12  D-1 2007-12-12 Post FeTot 4.11e+00  FALSE 4.11e+00 4.11e+00
   13  D-1 2007-12-12 Post MgTot 9.56e+00  FALSE 9.56e+00 9.56e+00
   14  D-1 2007-12-12 Post MnTot 3.48e-02  FALSE 3.48e-02 3.48e-02
   15  D-1 2007-12-12 Post     N 2.93e-01  FALSE 2.93e-01 2.93e-01
   16  D-1 2007-12-12 Post   NH4 9.70e-01  FALSE 9.70e-01 9.70e-01
   17  D-1 2007-12-12 Post NiTot 1.00e-02   TRUE 0.00e+00 1.00e-02
   18  D-1 2007-12-12 Post PbTot 2.53e-02  FALSE 2.53e-02 2.53e-02
   19  D-1 2007-12-12 Post    SC 6.30e+02  FALSE 6.30e+02 6.30e+02
   20  D-1 2007-12-12 Post   SO4 6.58e+01  FALSE 6.58e+01 6.58e+01
   21  D-1 2007-12-12 Post SbDis 8.25e-04   TRUE 0.00e+00 8.25e-04
   22  D-1 2007-12-12 Post SeTot 1.32e-03  FALSE 1.32e-03 1.32e-03
   23  D-1 2007-12-12 Post   TDS 3.20e+02  FALSE 3.20e+02 3.20e+02
   24  D-1 2007-12-12 Post TlTot 3.00e-04   TRUE 0.00e+00 3.00e-04
   25  D-1 2007-12-12 Post ZnTot 1.14e+01  FALSE 1.14e+01 1.14e+01
   26  D-1 2007-12-12 Post    pH 7.80e+00  FALSE 7.80e+00 7.80e+00

There are only 26 chemicals corresponding to that row, but there
are a total of 54 different possible chemicals to quantify in the
first row.  Thus, there must be NA values inserted to fill out the
data frame.  (The problem gets worse when you try to keep those
other data columns as id columns... they represent additional
distinct combinations so you end up with more rows and fewer values in 
each row.)

I am not familiar with the NADA library, so I cannot suggest what you 
SHOULD be doing, but it does seem that you should perhaps study some more 
examples of its use to figure out what form you should have your data in.

On Wed, 8 Aug 2012, arun wrote:

> Hi,
>
> I tried converting factors to character, but the results still has NAs.
> convert.type1 <- function(obj,types){
>     for (i in 1:length(obj)){
>         FUN <- switch(types[i],character = as.character,
>                                    numeric = as.numeric,
>                                    factor = as.factor,
>                    Date=as.Date.character,
>                    logical=as.logical)   
>         obj[,i] <- FUN(obj[,i])
>     }
>     obj
> }
>
> sample.melt1<-convert.type1(sample.melt,c("character","Date","character","character","logical","numeric","numeric","character","numeric"))
>  str(sample.melt1)
> #'data.frame':    715 obs. of  9 variables:
> # $ site    : chr  "D-1" "D-1" "D-1" "D-1" ...
> # $ sampdate: Date, format: "2007-12-12" "2007-12-12" ...
> # $ era     : chr  "Post" "Post" "Post" "Post" ...
> # $ param   : chr  "AgTot" "AlTot" "Alk" "AsTot" ...
> # $ ceneq1  : logi  TRUE FALSE FALSE FALSE TRUE FALSE ...
> # $ floor   : num  0 0.106 231 0.0113 0 100 0 1.43 0 0.0239 ...
> # $ ceiling : num  1.30e-04 1.06e-01 2.31e+02 1.13e-02 5.00e-03 1.00e+02 2.00e-04 1.43 6.00e-03 2.39e-02 ...
> # $ variable: chr  "quant" "quant" "quant" "quant" ...
> # $ value   : num  1.30e-04 1.06e-01 2.31e+02 1.13e-02 5.00e-03 1.00e+02 2.00e-04 1.43 6.00e-03 2.39e-02 ...
>
> sample.cast <- dcast(sample.melt1, site + sampdate + era + ceneq1 + floor + ceiling ~ param)
> head(sample.cast)
>   #site   sampdate  era ceneq1   floor ceiling AgDis AgTot AlDis Alk AlTot AsDis
> #1  D-1 2007-12-12 Post  FALSE 0.00132 0.00132    NA    NA    NA  NA    NA    NA
> #2  D-1 2007-12-12 Post  FALSE 0.01130 0.01130    NA    NA    NA  NA    NA    NA
> #3  D-1 2007-12-12 Post  FALSE 0.02390 0.02390    NA    NA    NA  NA    NA    NA
> #4  D-1 2007-12-12 Post  FALSE 0.02530 0.02530    NA    NA    NA  NA    NA    NA
> #5  D-1 2007-12-12 Post  FALSE 0.03480 0.03480    NA    NA    NA  NA    NA    NA
> #6  D-1 2007-12-12 Post  FALSE 0.10600 0.10600    NA    NA    NA  NA 0.106    NA
>   #---------------------------------------------
>   #---------------------------------------------
>   #SO4 SrDis TDS TlDis TlTot TSS Vdis ZnDis ZnTot
> #1  NA    NA  NA    NA    NA  NA   NA    NA    NA
> #2  NA    NA  NA    NA    NA  NA   NA    NA    NA
> #3  NA    NA  NA    NA    NA  NA   NA    NA    NA
> #4  NA    NA  NA    NA    NA  NA   NA    NA    NA
> #5  NA    NA  NA    NA    NA  NA   NA    NA    NA
> #6  NA    NA  NA    NA    NA  NA   NA    NA    NA
>
> A.K.
>
>
>
>
>
> ----- Original Message -----
> From: Rich Shepard <rshepard at appl-ecosys.com>
> To: R help <r-help at r-project.org>
> Cc:
> Sent: Wednesday, August 8, 2012 10:48 PM
> Subject: Re: [R] reshape2's dcast() Adds NAs to Data Frame
>
> On Wed, 8 Aug 2012, Jeff Newmiller wrote:
>
>> The explanation is that this is normal and consistent with behavior of
>> factors in general. If you don't want that, it is common to work with
>> character data instead of factors, only converting to factor when needed.
>> In most cases I invoke read.table with the as.is=TRUE argument and delay
>> converting to factors until I need them. Other people convert from factor
>> to character and back to factor to get rid of unwanted factor levels on an
>> as-needed basis.
>
> Jeff,
>
>   First thing tomorrow I will research the difference between characters and
> data; I assumed they were the same.
>
> Thanks,
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------


More information about the R-help mailing list