[R-sig-eco] na.exclude in vegan

Sat Feb 20 08:52:29 CET 2010

On 19/02/10 22:24 PM, "L Quinn" <lquinn at hotmail.com> wrote:

> I am attempting a cca in vegan with a dataset with missing values in both the
> community dataset and the environmental dataset. I tried entering commands for
> na.omit or na.exclude, but didn't have any luck. Perhaps I'm entering them in
> the wrong format?
>
I already answered this briefly in a private message. Here a bit longer
answer to all.

Missing values are not allowed in dependent data in cca() in vegan. This is
documented under 'na.action' item in the cca help.

The reason for this behaviour is that nobody cared to implement NA handling
for dependent data. The cca() function was designed for community data, and
community data should have no missing values. It is not quite clear to me
how you should implement NA handling in community data: should you regard
the observation (row) as missing or the species (column) as missing?

Assuming that you want to regard the row as missing, then you can circumvent
the cca() heuristic by using argument 'subset'. With the example of this
message, you could write:

cca(ind ~ ., env, na=na.omit, subset = complete.cases(ind))

which will select only complete items (without NA entries) of the 'ind'
data. It is consistent to use 'na.action = na.omit' since 'subset' works in
the 'na.omit' fashion, but technically you can also use 'na.action =
na.exclude'. 

Finally a comment of the use of cca(): it is not really suitable for other
than community data, or to cases where the response variables are measured
in different and arbitrary units. The infamous double-standardization of CA
can balance scarce and abundant species, but it cannot adjust variables in
different units. CA and CCA are based on row profiles, or allocation of row
totals among columns in proportions of column totals. This is not meaningful
if your columns are not expressed in equal and comparable units. Therefore I
recommend using rda with scaling:

rda(ind ~ ., env, na=na.omit, subset = complete.cases(ind), scale=TRUE)

As a minor detail, you may lose a large part of your data if you have NA
both in ind and env: only two observations were left in the example subset
you had in your message. I'm not sure that so fragmented models are
meaningful.

Best wishes, Jari Oksanen