[R-sig-eco] Is stepacross a must?

Sat Jun 23 13:29:18 CEST 2018

Dear Alexandre,

You can check how much difference extending the dissimilarities makes to the overall compositional patterns of your data by comparing PCoA plots based on extended vs. original dissimilarities. If your compositional gradient is long, the original dissimilarities may give you an ordination where the main compositional gradient is forced into a circle, because no sites are allowed to be separated by more than 1 unit in the ordination space. In this case, it will be very hard to interpret the compositional gradient or to find out which explanatory variables it might be related to. Classifications may also end up grouping sites from different parts of the gradient with each other. Extended dissimilarities should give you an ordination where the main compositional gradient is straightened up and more or less parallel to the first PCoA axis, because sites are then allowed to be separated by more than 1 unit. This makes it more likely that any explanatory variables that correlate with the main compositional gradient actually get discovered in the analyses, and that classification results are more consistent with the compositional gradient as well.

As a general rule, I'd say that if the percentage of saturated dissimilarities is large (>20%), it makes sense to use extended dissimilarities instead of the original ones. To help further in deciding, make a histogram of the extended dissimilarities. The larger the maximum, and the larger the percentage of values exceeding unity, the more likely it is that you get more realistic results with the extended than with the original dissimilarities.

I hope this helps.
Hanna Tuomisto

> On 23 Jun 2018, at 13:11, Alexandre F. Souza <alexsouza.cb.ufrn.br using gmail.com> wrote:
> 
> ---------- Forwarded message ----------
> From: Alexandre F. Souza <alexsouza.cb.ufrn.br using gmail.com>
> Date: 2018-06-23 7:04 GMT-03:00
> Subject: Is stepacross a must?
> To: Lista de discussao R-sig-ecology <r-sig-ecology using r-project.org>
> 
> 
> Dear friends,
> 
> I am in deep doubt regarding whether or not to apply vegan'a stepacross
> function to my dissimilarity matrix. The matrix is a simpson dissimilarity
> matrix with 665 rows reflecting beta diversity between localities along the
> brazilian atlantic florest, with over 4000 species most of which are
> singletons. They were retained for biological reasons.
> 
> We are appying a community modelling approach with k-means in order to find
> the best number of clusters this dataset can be divided. The problem is
> that we are obtaining quite different results with or without applying the
> stepacross function. According to Tuomisto et al. 2012:
> 
> ..."the stepacross
> method or extended dissimilarities (Williamson 1978,
> De’ath 1999), originally developed to improve the performance
> of ordination. Th e aim is to fi nd the shortest path
> between sites that do not share any species by using intermediate
> sites as stepping-stones. The process replaces all saturated
> dissimilarity values (or any dissimilarities larger than
> a specifi ed threshold value) with new values that can exceed
> unity. The new values can be as large as is necessary to reflect the amount
> of compositional heterogeneity in the dataset,
> so using extended dissimilarities instead of the original
> ones in Mantel tests and multiple regression on dissimilarity
> matrices can be expected to improve the performance of
> these methods."
> 
> I reproduce the code below. Without stepacross we obtain dissimilarities by
> large dominated by 1, and as many as 25 groups.
> With stepacross they are more unimodal since the =1 distances were allowed
> to have higher values, but as few as 6 groups! Furthermore, the final
> number of groups varies more between runs with stepacross than without it.
> With stepacross it seems almost to alternate between 6 and 11 groups, with
> 6 being more frequent.
> 
> What do you thin it would be wiser to do? I am leaning towards the
> stepacross.... but I am not that sure. The help page of the function
> mentions that in cases with high beta-diversity datasets stepacross must be
> used.
> 
> Thanks for any thoughts,
> 
> Alexandre
> 
> simpdist = recluster.dist(sp, dist = "simpson")
> simp.step = stepacross(simpdist)
> dissimilarity_matrix<-as.matrix(simp.step)
> 
> # Perform an ordination via Principal Coordinate Analysis (PCoA) with
> correction for negative eigenvalues [add = T]
> ordination_output<-cmdscale(dissimilarity_matrix,
> k=(nrow(dissimilarity_matrix)-1), eig = F, add = T, x.ret = FALSE)
> pcoa_scores<-as.data.frame(ordination_output[[1]]) # Extract the PCoA
> scores for each site
> 
> # Get values of the evaluation criterion (Within-Groups SS) for k varying
> from 2 to 'number of sites minus 1'
> sum_of_squares <- foreach(i = 2:(ncol(dissimilarity_matrix)-1),
>                          .combine = 'cbind')  %dopar% { # cbind the
> results of foreach
>                            classification<-kmeans(pcoa_scores, i,
> iter.max=1000, nstart=100)#usar inter.max=1000 e nstart=100
>                            classification$tot.withinss} # the
> Within-Groups SS for k=i
> y<-t(as.matrix(sum_of_squares)) # convert Within-Group SS to a vector (y
> for piecewise regressions)
> x<-c(2:(ncol(dissimilarity_matrix)-1)) # number of clusters (x for
> piecewise regressions)
> 
> # Create an empty matrix to receive the values of 'max-k entered a priori'
> and the respective 'optimal k selected'
> initial_kmax<-c(4:ncol(dissimilarity_matrix)) # At least 4 points to
> produce two lines (L-method)
> cluster_results<-cbind(initial_kmax, c(NA)) # Dataframe to receive values
> of the breakpoints
> 
> # Perform multiple piecewise regression varying the max-k entered a priori
> from 4 to 'number of sites'
> for (j in 4:ncol(dissimilarity_matrix)){ # the maximum number of 'k'
> established a priori
>  y_selected<-y[1:j] # Limit the number of starting points (y axis)
>  x_selected<-x[1:j] # Limit the number of starting points (x axis)
>  find_the_knee <- foreach(i = 2:(j-1), # calculate the RSE for all
> possible breakpoints (i) when starting with k-max = j
>                           .combine = 'cbind')  %dopar% { #
> 
> piecewise_k<-lm(y_selected~x_selected*(x_selected<i)
> + x_selected*(x_selected>=i))
>                             summary(piecewise_k)[6]} # print the RSE for
> each piecewise regression
>  find_the_knee<-c(do.call("cbind",find_the_knee)) # convert the previous
> output (list) to vector
>  knee_found<-as.data.frame(cbind(c(2:(j-1)), find_the_knee)) # combine
> position of breakpoint and RSE value
>  names(knee_found)<-c("breakpoint", "RSE") # rename
>  knee_found<-na.omit(knee_found) # remove NaN values
>  k_number<-knee_found[which(knee_found$RSE==min(knee_found$RSE)),1] #
> select breakpoint that minimizes RSE
>  cluster_results[(j-3),2]<-k_number} # print the maximum number of initial
> groups (j) and the respective selected breakpoint (k)
> (Sys.time()-a) # How long it takes?
> 
> ############################################################
> ##############################
> # STEP 3 Inspect the variation in the values of optimal 'k' and identify
> the most frequent 'k'
> 
> # Identify the most frequent 'k' (math-mode)
> most_frequent_k<-as.data.frame(cluster_results)[,2]
> optimal_k<-names(sort(-table(most_frequent_k)))[1]
> optimal_k # what is the most frequent value?
> hist(most_frequent_k, n=50, ylab="Frequency of k", xlab="Value of the
> optimal k", main="", cex.axis=1.0, cex.lab=1.2, las=1, col="grey")
> 
> 
> 
> 
> -- 
> Dr. Alexandre F. Souza
> Professor Adjunto IV
> Universidade Federal do Rio Grande do Norte
> CB, Departamento de Ecologia
> Campus Universitário - Lagoa Nova
> 59072-970 - Natal, RN - Brasil
> lattes: lattes.cnpq.br/7844758818522706
> http://www.esferacientifica.com.br
> http://www.docente.ufrn.br/alexsouza
> orcid.org/0000-0001-7468-3631 <http://www.docente.ufrn.br/alexsouza>
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology