--- title: "Quick start for the sommer package" author: "Giovanny Covarrubias-Pazaran" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick start for the sommer package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The sommer package was developed to provide R users with a powerful and reliable multivariate mixed model solver. ThThe package is focused on two approaches: 1) p > n (more effects to estimate than observations) using the mmer() function, and 2) n > p (more observations than effects to estimate) using the mmec() function. The core algorithms are coded in C++ using the Armadillo library. This package allows the user to fit mixed models with the advantage of specifying the variance-covariance structure for the random effects, specifying heterogeneous variances, and obtaining other parameters such as BLUPs, BLUEs, residuals, fitted values, variances for fixed and random effects, etc. The purpose of this quick start guide is to show the flexibility of the package under certain common scenarios: **SECTION 1: Introduction** 1) Background on mixed models 2) Background on covariance structures **SECTION 2: Different models enabled in sommer** 1) Univariate homogeneous variance models 2) Univariate heterogeneous variance models 3) Univariate unstructured variance models 4) Multivariate homogeneous variance models 5) Details on special functions for variance models + the major vsr() and vsc() functions for special variance models and its auxiliars: + atr() and atc() specific levels heterogeneous variance structure + dsr() and dsc() diagonal covariance structure + usr() and usc() unstructured covariance + csr() and csc() customized covariance structure + isc() function for main effects 6) The specification of constraints in the variance-covariance structures + unsm() unstructured constraint + fixm() fixed constraint + fcm() constraints on fixed effects 7) Special functions for special models + Random regression models + Overlayed models + Spatial models + GWAS models + Customized random effects 8) Genomic selection (predicting mendelian sampling) + GBLUP + rrBLUP 9) Likelihood ratio tests 10) Final remarks **SECTION 3: The predict function** 1) Background on prediction 2) Predicting means ## SECTION 1: Introduction ### 1) Background on mixed models The core of the package is the `mmer` function which solves the mixed model equations. The functions are an interface to call the `NR` Direct-Inversion Newton-Raphson or Average Information algorithms (Tunnicliffe 1989; Gilmour et al. 1995; Lee et al. 2016). From version 2.0, sommer can handle multivariate models. Following Maier et al. (2015), the multivariate (and by extension the univariate) mixed model implemented has the form:
$y_1 = X_1\beta_1 + Z_1u_1 + \epsilon_1$ $y_2 = X_2\beta_2 + Z_2u_2 + \epsilon_2$ ... $y_i = X_i\beta_i + Z_iu_i + \epsilon_i$
where $y_i$ is a vector of trait phenotypes, $\beta_i$ is a vector of fixed effects, $u_i$ is a vector of random effects for individuals and $e_i$ are residuals for trait ‘i’ (i = 1, …, t). The random effects ($u_1$ ... $u_i$ and $e_i$) are assumed to be normally distributed with mean zero. X and Z are incidence matrices for fixed and random effects respectively. The distribution of the multivariate response and the phenotypic variance covariance (V) are:
$Y = X\beta + ZU + \epsilon_i$
Y ~ MVN($X\beta$, V)
$$\mathbf{Y} = \left[\begin{array} {r} y_1 \\ y_2 \\ ... \\ y_t \\ \end{array}\right] $$
$$\mathbf{X} = \left[\begin{array} {rrr} X_1 & 0 & 0 \\ \vdots & \ddots & \vdots\\ 0 & 0 & X_t \\ \end{array}\right] $$
$$\mathbf{V} = \left[\begin{array} {rrr} Z_1 K{\sigma^2_{g_{1}}} Z_1' + H{\sigma^2_{\epsilon_{1}}} & ... & Z_1 K{\sigma_{g_{1,t}}} Z_t' + H{\sigma_{\epsilon_{1,t}}} \\ \vdots & \ddots & \vdots\\ Z_1 K{\sigma_{g_{1,t}}} Z_t' + H{\sigma_{\epsilon_{1,t}}} & ... & Z_t K{\sigma^2_{g_{t}}} Z_t' + H{\sigma^2_{\epsilon_{t}}} \\ \end{array}\right] $$
where K is the relationship or covariance matrix for the kth random effect (u=1,…,k), and H=I is an identity matrix or a partial identity matrix for the residual term. The terms $\sigma^2_{g_{i}}$ and $\sigma^2_{\epsilon_{i}}$ denote the genetic (or any of the kth random terms) and residual variance of trait ‘i’, respectively and $\sigma_{g_{_{ij}}}$ and $\sigma_{\epsilon_{_{ij}}}$ the genetic (or any of the kth random terms) and residual covariance between traits ‘i’ and ‘j’ (i=1,…,t, and j=1,…,t). The algorithm implemented optimizes the log likelihood:
$logL = 1/2 * ln(|V|) + ln(X'|V|X) + Y'PY$
where || is the determinant of a matrix. And the REML estimates are updated using a Newton optimization algorithm of the form:
$\theta^{k+1} = \theta^{k} + (H^{k})^{-1}*\frac{dL}{d\sigma^2_i}|\theta^k$
Where $\theta$ is the vector of variance components for random effects and covariance components among traits, $H^{-1}$ is the inverse of the Hessian matrix of second derivatives for the kth cycle, $\frac{dL}{d\sigma^2_i}$ is the vector of first derivatives of the likelihood with respect to the variance-covariance components. The Eigen decomposition of the relationship matrix proposed by Lee and Van Der Werf (2016) was included in the Newton-Raphson algorithm to improve time efficiency. Additionally, the popular `vpredict()` function to estimate standard errors for linear combinations of variance components (i.e. heritabilities and genetic correlations) was added to the package as well. Please refer to the canonical papers listed in the Literature section to check how the algorithms work. We have tested widely the methods to make sure they provide the same solution when the likelihood behaves well but for complex problems they might lead to slightly different answers. If you have any concern please contact me at cova_ruber@live.com.mx. In the following section we will go in detail over several examples on how to use mixed models in univariate and multivariate case and their use in quantitative genetics.
### 2) Background on covariance structures One of the major strenghts of linear mixed models is the flexibility to specify variance-covariance structures at all levels. In general, variance structures of mixed models can be seen as tensor (kronecker) products of multiple variance-covariance stuctures. For example, a multi-response model (i.e. 2 traits) where "g" individuals (i.e. 100 individuals) are tested in "e" treatments (i.e. 3 environments), the variance-covariance for the random effect "individuals" can be seen as the following multiplicative model: T $\otimes$ G $\otimes$ A where: $$\mathbf{T} = \left[\begin{array} {rr} {\sigma^2_{g_{_{t1,t1}}}} & {\sigma_{g_{_{t1,t2}}}} \\ {\sigma_{g_{_{t2,t1}}}} & {\sigma^2_{g_{_{t2,t2}}}} \\ \end{array}\right] $$ is the covariance structure for individuals among traits. $$\mathbf{G} = \left[\begin{array} {rrr} {\sigma^2_{g_{_{e1,e1}}}} & {\sigma_{g_{_{e1,e2}}}} & {\sigma_{g_{_{e1,e3}}}} \\ {\sigma_{g_{_{e2,e1}}}} & {\sigma^2_{g_{_{e2,e2}}}} & {\sigma_{g_{_{e2,e3}}}} \\ {\sigma_{g_{_{e3,e1}}}} & {\sigma_{g_{_{e3,e2}}}} & {\sigma^2_{g_{_{e3,e3}}}} \\ \end{array}\right] $$ is the covariance structure for individuals among environments. and $A$ is a square matrix representing the covariance among the levels of the individuals (any known relationship matrix). The T and G covariance structures shown above are unknown matrices to be estimated whereas A is known. The T and G matrices shown above are called as unstructured (US) covariance matrices, although this type is just one example from several covariance structures that the linear mixed models enable. For example, other popular covariance structures are: Diagonal (DIAG) covariance structures $$\mathbf{\Sigma} = \left[\begin{array} {rrr} {\sigma^2_{g_{_{e1,e1}}}} & 0 & 0 \\ \vdots & \ddots & \vdots \\ 0 & 0 & {\sigma^2_{g_{_{ei,ei}}}} \\ \end{array}\right] $$ Compound simmetry (CS) covariance structures $$\mathbf{\Sigma} = \left[\begin{array} {rrrr} {\sigma^2_{g}} + {\sigma^2_{ge}} & {\sigma^2_{g}} & {\sigma^2_{g}} & {\sigma^2_{g}} \\ {\sigma^2_{g}} & {\sigma^2_{g}} + {\sigma^2_{ge}} & {\sigma^2_{g}} & {\sigma^2_{g}}\\ \vdots & \vdots & \ddots & \vdots\\ {\sigma^2_{g}} & {\sigma^2_{g}} & {\sigma^2_{g}} & {\sigma^2_{g}} + {\sigma^2_{ge}}\\ \end{array}\right] $$ First order autoregressive (AR1) covariance structures $$\mathbf{\Sigma} = \sigma^2 \left[\begin{array} {rrrr} 1 & {\rho} & {\rho^2} & {\rho^3} \\ {\rho} & 1 & {\rho} & {\rho^2}\\ {\rho^2} & {\rho} & 1 & {\rho} \\ {\rho^3} & {\rho^2} & {\rho} & 1 \\ \end{array}\right] $$ or the already mentioned Unstructured (US) covariance structures $$\mathbf{\Sigma} = \left[\begin{array} {rrr} {\sigma^2_{g_{_{e1,e1}}}} & {\sigma_{g_{_{e1,e2}}}} & {\sigma_{g_{_{e1,e3}}}} \\ \vdots & \ddots & \vdots \\ {\sigma_{g_{_{e3,e1}}}} & {\sigma_{g_{_{e3,e2}}}} & {\sigma^2_{g_{_{e3,e3}}}} \\ \end{array}\right] $$ among others. Sommer has the capabilities to fit some of these covariance structures in the mixed model machinery. ## SECTION 2: Different models enabled in sommer ### 1) Univariate homogeneous variance models This type of model refers to single response models where a variable of interest (i.e. genotypes) needs to be analyzed as interacting with a 2nd random effect (i.e. environments), but you assume that across environments the genotypes have the same variance component. This is the so-called compound symmetry (CS) model. ```{r} library(sommer) data(DT_example) DT <- DT_example ## solving for r records ans1r <- mmer(Yield~Env, random= ~ Name + Env:Name, rcov= ~ units, data=DT, verbose = FALSE) summary(ans1r)$varcomp ## solving for c coefficients ans1c <- mmec(Yield~Env, random= ~ Name + Env:Name, rcov= ~ units, data=DT, verbose = FALSE) summary(ans1c)$varcomp ``` ### 2) Univariate heterogeneous variance models Very often in multi-environment trials, the assumption that the genetic variance or the residual variance is the same across locations may be too naive. Because of that, specifying a general genetic component and a location specific genetic variance is the way to go. This requires a CS+DIAG model (also called heterogeneous CS model). ```{r} data(DT_example) DT <- DT_example ans2r <- mmer(Yield~Env, random= ~Name + vsr(dsr(Env),Name), rcov= ~ vsr(dsr(Env),units), data=DT, verbose = FALSE) summary(ans2r)$varcomp DT=DT[with(DT, order(Env)), ] ans2c <- mmec(Yield~Env, random= ~Name + vsc(dsc(Env),isc(Name)), rcov= ~ vsc(dsc(Env),isc(units)), data=DT, verbose = FALSE) summary(ans2c)$varcomp ``` As you can see the special function `atr` or `diag` can be used to indicate that there's a different variance for the genotypes in each environment. The same was done for the residual. The difference between `atr` and `diag` is that the `atr` function can be used to specify the levels or specific environments where the variance is different. ### 3) Unstructured variance models A more relaxed asumption than the CS+DIAG model is the unstructured model (US) which assumes that among the levels of certain factor (i.e. Environments) there's a covariance struture of a second random effect (i.e. Genotypes). This can be done in sommer using the `usr(.)` function: ```{r} data(DT_example) DT <- DT_example ans3r <- mmer(Yield~Env, random=~ vsr(usr(Env),Name), rcov=~vsr(dsr(Env),units), data=DT, verbose = FALSE) summary(ans3r)$varcomp DT=DT[with(DT, order(Env)), ] ans3c <- mmec(Yield~Env, random=~ vsc(usc(Env),isc(Name)), rcov=~vsc(dsc(Env),isc(units)), data=DT, verbose = FALSE) summary(ans3c)$varcomp ``` As can be seen the `usr(Env)` indicates that the genotypes (Name) can have a covariance structure among environments (Env). ### 4) Multivariate homogeneous variance models Currently there's a great push for multi-response models. This is motivated by the correlation that certain variables hide and that could benefit in the prediction perspective. In sommer to specify multivariate models the response requires the use of the `cbind()` function in the response, and the `usr(trait)`, `diag(trait)`, or `atr(trait)` functions in the random part of the model. ```{r} data(DT_example) DT <- DT_example DT$EnvName <- paste(DT$Env,DT$Name) DT$Yield <- as.vector(scale(DT$Yield)) DT$Weight <- as.vector(scale(DT$Weight)) ans4r <- mmer(cbind(Yield, Weight) ~ Env, random= ~ vsr(Name, Gtc=unsm(2)), rcov= ~ vsr(units, Gtc=diag(2)), data=DT, verbose = FALSE) summary(ans4r)$varcomp DT2 <- reshape(DT, idvar = c("Name","Env","Block"), varying = list(6:7), v.names = "y", direction = "long", timevar = "trait", times = colnames(DT)[6:7]) DT2$trait <- as.factor(DT2$trait) # ans4c <- mmec(y ~ Env:trait, # random= ~ vsc(usc(trait),isc(Name)), # rcov= ~ vsc(dsc(trait),isc(units)), returnParam = T, # data=DT2, verbose = T) # summary(ans4c)$varcomp ``` You may notice that we have added the `usr(trait)` behind the random effects. This is to indicate the structure that should be assumed in the multivariate model. The `diag(trait)` used behind a random effect (i.e. Name) indicates that for the traits modeled (Yield and Weight) there's no covariance component and should not be estimated, whereas `usr(trait)` assumes that for such a random effect, there's a covariance component to be estimated (i.e. covariance between Yield and Weight for the random effect Name). The same applies for the residual part (rcov). Any number of random effects can be specified with different structures. ### 5) Details on special functions for variance models #### the vsr() function for special variance models and its auxiliaries The sommer function `vsr()` allows constructing complex variance models that are passed to the `mmer()` function which constitutes one of the most important features of the sommer package. The specification of the `vsr()` function has the form: `random=~vsr(..., Gu, Gti, Gtc)` The idea is that the `vsr()` function reflects the special variance structure that each random effect could have in the matrix notation: $var(u) = T \bigotimes E \bigotimes ... \bigotimes A$ where the ... argument in the `vsr()` function is used to specify the kronecker products from all matrices that form the variance for the random effect , where the auxiliary functions `dsr()`, `usr()`, `csr()`, `atr()`, can be used to define such structured variance structures. The idea is that a variance model for a random effect x (i.e. individuals) might require a more flexible model than just: `random=~x` For example, if individuals are tested in different time-points and environments, we can assume a different variance and covariance components among the individuals in the different environment-timepoint combinations. An example of variance structure of the type: $var(u) = T \bigotimes E \bigotimes S \bigotimes A$ would be specified in the `vsr()` function as: `random=~vsr(usr(e),usr(s),x, Gu=A, Gtc=T)` where the `e` would be a column vector in a data frame for the environments, `s` a column vector in the dataframe for the time points, `x` is the vector in the datrame for the identifier of individuals, `A` is a known square variance covariance matrix among individuals (usually an identity matrix; default if not specified), and T is a square matrices with as many rows and columns as the number of traits that specifies the trait covariance structure. The auxiliary functions to build the variance models for the random effect are: + dsr() diagonal covariance structure + usr() unstructured covariance + atr() specific levels heterogeneous variance structure + csr() customized covariance structure #### the vsc() function for special variance models and its auxiliaries The sommer function `vsc()` allows constructing complex variance models that are passed to the `mmec()` function which constitutes one of the most important features of the sommer package. The specification of the `vsc()` function has the form: `random=~vsc(..., Gu)` where the different variance structures are input as before but the constraints are specified within the auxiliar function with the argument theta and thetaC. Please notice that all functions for c coefficients end in the letter c instead or r. #### dsr() and dsc() to specify a diagonal (DIAG) covariance structures A diagonal covariance structure looks like this: $$\mathbf{\Sigma} = \left[\begin{array} {rrr} {\sigma^2_{g_{_{e1,e1}}}} & 0 & 0 \\ \vdots & \ddots & \vdots \\ 0 & 0 & {\sigma^2_{g_{_{ei,ei}}}} \\ \end{array}\right] $$ Considering an example for one random effect (g; indicating i.e. individuals) evaluated in different treatment levels (e; indicating i.e. the different treatments) the model would look like: `random=~vsr(dsr(e),g)` for r records `random=~vsc(dsc(e),isc(g))` for c coefficients #### usr() and usc() to specify an unstructured (US) covariance An unstructured covariance looks like this: $$\mathbf{G} = \left[\begin{array} {rrr} {\sigma^2_{g_{_{e1,e1}}}} & {\sigma_{g_{_{e1,e2}}}} & {\sigma_{g_{_{e1,e3}}}} \\ {\sigma_{g_{_{e2,e1}}}} & {\sigma^2_{g_{_{e2,e2}}}} & {\sigma_{g_{_{e2,e3}}}} \\ {\sigma_{g_{_{e3,e1}}}} & {\sigma_{g_{_{e3,e2}}}} & {\sigma^2_{g_{_{e3,e3}}}} \\ \end{array}\right] $$ Considering same example for one random effect (g; indicating i.e. individuals) evaluated in different treatment levels (e; indicating i.e. the different treatments) the model would look like: `random=~vsr(usr(e),g)` for r records `random=~vsc(usc(e),isc(g))` for c coefficients #### atr() and atc() to specify a level-specific heterogeneous variance A diagonal covariance structure for specific levels of the second random effect looks like this: $$\mathbf{\Sigma} = \left[\begin{array} {rrr} {\sigma^2_{g_{_{e1,e1}}}} & 0 & 0 \\ \vdots & \ddots & \vdots \\ 0 & 0 & {\sigma^2_{g_{_{ei,ei}}}} \\ \end{array}\right] $$ Considering the same example for one random effect (g; indicating i.e. individuals) evaluated in different treatment levels (e; indicating i.e. the different treatments A,B,C) the model would look like: `random=~vsr(atr(e,c("A","B")),g)` `random=~vsc(atc(e,c("A","B")),isc(g))` where the variance component for g is only fitted at levels A and B. #### csr() and csc() to specify a level-specific variance-covariance structure A customized covariance structure for specific levels of the second random effect (variance and covariances) looks i.e. like this: $$\mathbf{\Sigma} = \left[\begin{array} {rrr} {\sigma^2_{g_{_{e1,e1}}}} & {\sigma_{g_{_{e1,e2}}}} & 0 \\ \vdots & \ddots & \vdots \\ 0 & 0 & {\sigma^2_{g_{_{ei,ei}}}} \\ \end{array}\right] $$ Considering same example for one random effect (g; indicating i.e. individuals) evaluated in different treatment levels (e; indicating i.e. the different treatments A,B,C) the model would look like: `random=~vsr(csr(e,c("A","B")),g)` `random=~vsc(csc(e,c("A","B")),isc(g))` where `mm` indicates which variance and covariance components are estimated for `g`. ### 6) The specification of multi-trait constraints in the variance components for the vsr function (Gtc argument) One of the major strengths of sommer is its extreme flexibility to specify variance-covariance structures in the multi-trait framework. Since sommer 3.7 this is easily achieved by the use of the `vsr()` function and it's argument `Gtc`. The `Gtc` argument expects a matrix of constraints for the variance-covariance components for the random effect filled with numbers according to the following rules: + 0: parameter not to be estimated + 1: estimated and constrained to be positive + 2: estimated and unconstrained + 3: not to be estimated but fixed value provided in `Gti` Some useful functions to specify quickly the contraint matrices are `unsm()` for unstructured, `fixm()` for fixed constraint, and `fcm()` for fixed effect constraints. Consider a multi-trait model with 4 traits (y1,...,y4) and 1 random effect (u) and 1 fixed effect (x) `fixed=cbind(y1,y2,y3,y4)~x` `random= ~vsr(u, Gtc=?)` The constraint for the 4 x 4 matrix of variance covariance components to be estimated can be an: a) unstructured (variance components have to be positive and covariances either positive or negative) `random= ~vsr(u, Gtc=unsm(4))` and `random= ~vsr(dsc(x,thetaC=unsm(4)),isc(u))` ```{r} unsm(4) ``` c) fixed (variance or covariance components indicated with a 3 are considered fixed and values are provided in the Gti argument) `random= ~vsr(u, Gtc=fixm(4), Gti=mm)` and `random= ~vsr(dsc(x,thetaC=fixm(4)),isc(u))` ```{r} fixm(4) ``` where mm is a 4 x 4 matrix with initial values for the variance components. d) constraints for fixed effects `fixed= cbind(y1,y2,y3,y4)~vsr(x, Gtc=fcm(c(1,0,1,0)))` ```{r} fcm(c(1,0,1,0)) ``` where 1's and 0's indicate the traits where the fixed effect will be estimated (1's) and where it won't (0's). ### 7) Special functions for special models #### Random regression models In order to fit random regression models the user can use the `leg()` function to fit Legendre polynomials. This can be combined with other special covariance structures such as `dsr() / dsc()`, `usr() / usc()`, etc. ```{r} library(orthopolynom) data(DT_legendre) DT <- DT_legendre mRR2r<-mmer(Y~ 1 + Xf , random=~ vsr(usr(leg(X,1)),SUBJECT) , rcov=~vsr(units) , data=DT, verbose = FALSE) summary(mRR2r)$varcomp mRR2c<-mmec(Y~ 1 + Xf , random=~ vsc(usc(leg(X,1)),isc(SUBJECT)) , rcov=~vsc(isc(units)) , data=DT, verbose = FALSE) summary(mRR2c)$varcomp ``` Here, a numeric covariate X is used to explain the trajectory of the SUBJECTs and combined with an unstructured covariance matrix. The details can be found in the theory. #### GWAS models Although genome-wide association studies can be conducted through a variety of approaches, the use of mixed models to find associations between markers and phenotypes still one of the most popular approaches. One of the most classical and popular approaches is to test marker by marker trough mixed modeling (1 model by marker) to obtain the marker effect and a statistic reflecting the level of association usually provided as the -log10 p-value. The second most popular approach is to assume that the genetic variance component is similar for all markers and therefore the variance components are only estimated once (1 model for all markers) and use the inverse of the phenotypic variance matrix (V.inverse) to test all markers in the generalized linear model $b=(XV-X)-XV-y$. This makes the GWAS much faster and efficient without major loses. Given the straightforward extension, sommer provides the `GWAS` function which can fit both type of approaches (be aware that these are two methods among many existant in the literature) in univariate and multivariate models, that way genetically correlated traits can be tested together to increase the power of detection. In summary the GWAS model implemented in sommer to obtain marker effect is a generalized linear model of the form: b = (X'V-X)X'V-y with X = ZMi where: - b is the marker effect (dimensions 1 x mt) - y is the response variable (univariate or multivariate) (dimensions 1 x nt) - V- is the inverse of the phenotypic variance matrix (dimensions nt x nt) - Z is the incidence matrix for the random effect selected (gTerm argument) to perform the GWAS (dimensions nt x ut) Mi is the ith column of the marker matrix (M argument) (dimensions u x m) for t traits, n observations, m markers and u levels of the random effect. Depending if P3D is TRUE or FALSE the V- matrix will be calculated once and used for all marker tests (P3D=TRUE) or estimated through REML for each marker (P3D=FALSE). Here we show a simple GWAS model for an univariate example. ```{r} data(DT_cpdata) DT <- DT_cpdata GT <- GT_cpdata MP <- MP_cpdata #### create the variance-covariance matrix A <- A.mat(GT) # additive relationship matrix #### look at the data and fit the model head(DT,3) head(MP,3) GT[1:3,1:4] mix1 <- GWAS(color~1, random=~vsr(id,Gu=A) + Rowf + Colf, rcov=~units, data=DT, nIters=3, M=GT, gTerm = "u:id", verbose = FALSE) ms <- as.data.frame(mix1$scores) ms$Locus <- rownames(ms) MP2 <- merge(MP,ms,by="Locus",all.x = TRUE); manhattan(MP2, pch=20,cex=.5, PVCN = "color") ``` Be aware that the marker matrix `M` has to be imputed (no missing data allowed) and make sure that the number of rows in the M matrix is equivalent to the levels of the gTerm specified (i.e. if the gTerm is "id" and has 300 levels or in other words 300 individuals, then M has dimensions 300 x m, where m is the number of markers). #### Overlayed models [the overlay() function] Another very useful function is the `overlay` function, which allows for the overlay of matrices of different random effects and estimate a single variance component for the overlayed terms. ```{r} data("DT_halfdiallel") DT <- DT_halfdiallel head(DT) DT$femalef <- as.factor(DT$female) DT$malef <- as.factor(DT$male) DT$genof <- as.factor(DT$geno) #### model using overlay modhr <- mmer(sugar~1, random=~vsr(overlay(femalef,malef)) + genof, data=DT,verbose = FALSE) modhc <- mmec(sugar~1, random=~vsc(isc(overlay(femalef,malef, sparse = TRUE))) + genof,data=DT,verbose = FALSE) ``` here the `femalef` and `malef` random effects are overlayed becoming a single random effect that has the same variance component. #### Spatial models (using the 2-dimensional spline) We will use the CPdata to show the use of 2-dimensional splines for accomodating spatial effects in field experiments. In early generation variety trials the availability of seed is low, which makes the use of an unreplicated design a neccesity more than anything else. Experimental designs such as augmented designs and partially-replicated (p-rep) designs have become more common these days. In order to do a good job modeling the spatial trends happening in the field, special covariance structures have been proposed to accomodate such spatial trends (i.e. autoregressive residuals; ar1). Unfortunately, some of these covariance structures make the modeling rather unstable. More recently other research groups have proposed the use of 2-dimensional splines to overcome such issues and have a more robust modeling of the spatial terms (Lee et al. 2013; Rodríguez-Álvarez et al. 2018). In this example we assume an unreplicated population where row and range information is available which allows usr to fit a 2 dimensional spline model. ```{r} data(DT_cpdata) DT <- DT_cpdata GT <- GT_cpdata MP <- MP_cpdata ### mimic two fields A <- A.mat(GT) mix <- mmer(Yield~1, random=~vsr(id, Gu=A) + vsr(Rowf) + vsr(Colf) + spl2Da(Row,Col), rcov=~vsr(units), nIters=3, data=DT,verbose = FALSE) summary(mix) ``` Notice that the job is done by the `spl2Da()` function that takes the `Row` and `Col` information to fit a spatial kernel. #### Customized random effects One of the most powerful features of sommer is the ability to provide any customized matrix and estimate any random effect. For example: ```{r} data(DT_cpdata) DT <- DT_cpdata GT <- GT_cpdata MP <- MP_cpdata #### look at the data and fit the model mix1 <- mmer(Yield~1, random=~vsr(list(GT)), rcov=~units, nIters=3, data=DT,verbose = FALSE) ``` the matrix GT is provided as a random effect by encapsulating the matrix in a list and provided in the `vsr()` function. ### 10) Genomic selection In this section I decided to show the way you can fit an rrBLUP and GBLUP model in sommer using some wheat example data from CIMMYT in the genomic selection framework. This is the case of prediction of specific individuals within a population. It basically uses a similar model of the form:
$y = X\beta + Zu + \epsilon$
and takes advantage of the variance covariance matrix for the genotype effect known as the additive relationship matrix (A) and calculated using the `A.mat` function to establish connections among all individuals and predict the BLUPs for individuals that were not measured. In case the interest is to get BLUPs for markers the random effect is the actual marker matrix and the relationship among markers can be specified as well but in this example is assumed to be diagonal. ```{r} data(DT_wheat) DT <- DT_wheat GT <- GT_wheat colnames(DT) <- paste0("X",1:ncol(DT)) DT <- as.data.frame(DT);DT$id <- as.factor(rownames(DT)) # select environment 1 rownames(GT) <- rownames(DT) K <- A.mat(GT) # additive relationship matrix colnames(K) <- rownames(K) <- rownames(DT) # GBLUP pedigree-based approach set.seed(12345) y.trn <- DT vv <- sample(rownames(DT),round(nrow(DT)/5)) y.trn[vv,"X1"] <- NA ## GBLUP ans <- mmer(X1~1, random=~vsr(id,Gu=K), rcov=~units, nIters=3, data=y.trn,verbose = FALSE) # kinship based ans$U$`u:id`$X1 <- as.data.frame(ans$U$`u:id`$X1) rownames(ans$U$`u:id`$X1) <- gsub("id","",rownames(ans$U$`u:id`$X1)) cor(ans$U$`u:id`$X1[vv,],DT[vv,"X1"], use="complete") ## rrBLUP ans2 <- mmer(X1~1, random=~vsr(list(GT), buildGu = FALSE), rcov=~units, getPEV = FALSE, nIters=3, data=y.trn,verbose = FALSE) # kinship based u <- GT %*% as.matrix(ans2$U$`u:GT`$X1) # BLUPs for individuals rownames(u) <- rownames(GT) cor(u[vv,],DT[vv,"X1"]) # same correlation # the same can be applied in multi-response models in GBLUP or rrBLUP # same can be achieved with the mmec function (see ?DT_wheat) ``` ### 11) Likelihood ratio tests The Likelihood ratio test (LRT) is a good way to investigate the significance of random effects or specific variance-covariance components. #### 11.1) Testing the significance of a variance component For example, imagine that a researcher would like to know if his model improves when adding the effect of a spatial kernel to capture the spatial trend in the field, his base model may look like this: ```{r} data(DT_cpdata) DT <- DT_cpdata GT <- GT_cpdata MP <- MP_cpdata ### mimic two fields A <- A.mat(GT) mix1 <- mmer(Yield~1, random=~vsr(id, Gu=A) + vsr(Rowf) + vsr(Colf), rcov=~vsr(units), nIters=3, data=DT, verbose = FALSE) ``` And the model with the spatial kernel is the following: ```{r} mix2 <- mmer(Yield~1, random=~vsr(id, Gu=A) + vsr(Rowf) + vsr(Colf) + spl2Da(Row,Col), rcov=~vsr(units), nIters=3, data=DT,verbose = FALSE) ``` Then to test if the second model brings value let usr fit the likelihood ratio test as follows: ```{r} lrt <- anova(mix1, mix2) ``` As can be seen the test turns out to not be very significant despite the increase in the likelihood. #### 11.2) Testing the significance of a covariance component Sometimes the researcher is more interested in knowing if a covariance structure is relevant or not. Assume we have two multi-trait models, 1) fitting no-covariance (independent) among traits, and 2) one fitting the genetic covariance among yield and color in the following population: ```{r} data(DT_example) DT <- DT_example DT$EnvName <- paste(DT$Env,DT$Name) modelBase <- mmer(cbind(Yield, Weight) ~ Env, random= ~ vsr(Name, Gtc=diag(2)), # here is diag() rcov= ~ vsr(units, Gtc=unsm(2)), nIters=3, data=DT,verbose = FALSE) modelCov <- mmer(cbind(Yield, Weight) ~ Env, random= ~ vsr(usr(Env),Name, Gtc=unsm(2)), # here is unsm() rcov= ~ vsr(dsr(Env),units, Gtc=unsm(2)), nIters=3, data=DT,verbose = FALSE) lrt <- anova(modelBase, modelCov) ``` As can be seen, in this case fitting the covariance among the genotypes improves the model fit considerably and the probablity of the Chi-square distribution is < 0.05. Then you the model with the covariance is the preferred model. ## SECTION 3: The predict function ### 1) Background on prediction In a linear mixed model where y is an nx1 vector of observations, the linear mixed model can be written as: $y = X\tau + Zu + e = W \beta + e$ where $\tau$ is a vector tx1 of fixed effects, X is an nxt design matrix which associates observations with the appropriate combinations of fixed effects, u is the qx1 vector of random effects, Z is the nxq matrix which associates observations with the appropriate random effects, and e is the nx1 vector of residual errors. For shorthand W and $\beta$ represent the combined design matrix and vector of efects, respectively. It is assumed: $$ \left[\begin{array} {r} u \\ e \\ \end{array}\right] = N(\left[\begin{array} {r} 0 \\ 0 \\ \end{array}\right] , \left[\begin{array} {rr} G{(\gamma)} & 0 \\ 0 & R{(\phi)} \\ \end{array}\right]) $$ where the covariance matrices G and R for random effects and residuals are functions of parameters $\gamma$ and $\phi$ respectively. The covariance matrix of data can be written as: $var(y) = \sigma^2 (ZGZ'+R)$ The variance parameters $\gamma$ and $\phi$ are usually estimated by maximum likelihood or REML (Patterson and Thompson, 1971). The BLUP of a linear combination for known D, $\gamma$, and $\phi$ is then: $$\mathbf{D\overline{\beta}} = \left[\begin{array} {rr} D_{\tau} & D_u \\ \end{array}\right] \left[\begin{array} {r} \overline{\tau} \\ \overline{u} \\ \end{array}\right] = {D \overline{\tau}} + {D_{u}\overline{u}} $$ where $\overline{\beta} = (\tau, u)$ is the solution of the mixed moel equations: $$ \left[\begin{array} {rr} X^TR^{-1}X & X^TR^{-1}Z \\ Z^TR^{-1}X & Z^TR^{-1}Z + G^{-1} \\ \end{array}\right] \left[\begin{array} {r} \tau \\ u \\ \end{array}\right] = \left[\begin{array} {r} X^TR^{-1}y \\ Z^TR^{-1}X \\ \end{array}\right] $$ These can also be written as $C \overline{\beta} = W^TR^{-1}y$. The $\overline{\tau}$ is the best linear unbiased estimator (BLUE) of $\tau$ and $\overline{u}$ is the best linear unbiased predictor (BLUP) of $u$, this with variance $var(\overline{\beta}-\beta)=C^{-1}$. Consideration of the values required to form the confidence intervals make it clear that is the prediction error variance (PEV or $var(\overline{\beta}-\beta)$), rather than the variance of the estimator $var(\overline{\beta})$ that is usually of interest. The prediction error variance for a linear combination $D\overline{\beta}$ is then $DC^{-1}D^T$. Since the variance parameters are unknown, we replace the unknown variance parameters by their REML estimates and use the empirical values. The `predict()` function of the sommer package then builts the $C^{-1}$ matrix from the mixed model equations and the $D$ matrix of linear combinations to obtain the PEV and SEs for the predictions. For the means it uses the $D$ matrix of linear combinations times the vector of required fixed and random effects $X\overline{\tau} + Z \overline{u}$ or $D\overline{\beta}$, where the matrix D is a linear combination of specific fixed and/or random effects from matrices $X$ and/or $Z$. ### 2) Predicting means The sommer package is equiped with a `predict()` function that can be used to calculate means and standard errors for fixed and random effects specified in the models fitted with the `mmer()` and `mmec` functions. Using the yatesoats dataset we will will fit some fixed and random effects. ```{r} library(sommer) data(DT_yatesoats) DT <- DT_yatesoats m3 <- mmer(fixed=Y ~ V + N + V:N, random = ~ B + B:MP, rcov=~units, data = DT, verbose=FALSE) summary(m3)$varcomp ``` Now, the model can be used together with the `classify` argument to obtain means for the `classify` argument. For example, the model includes in the fixed formula the terms "V" for variety, "N" for nitrogen treatments, and "V:N" for the interaction between variety and nitrogen. The `classify` argument can be used to specify the term for which the means are desired. In the following example the means for the nitrogen treatments are obtained as follows: ```{r} Dt <- m3$Dtable; Dt # first fixed effect just average Dt[1,"average"] = TRUE # second fixed effect include Dt[2,"include"] = TRUE # third fixed effect include and average Dt[3,"include"] = TRUE Dt[3,"average"] = TRUE Dt pp=predict(object=m3, Dtable=Dt, D="N") pp$pvals ``` ## Literature Covarrubias-Pazaran G. 2016. Genome assisted prediction of quantitative traits using the R package sommer. PLoS ONE 11(6):1-15. Covarrubias-Pazaran G. 2018. Software update: Moving the R package sommer to multivariate mixed models for genome-assisted prediction. doi: https://doi.org/10.1101/354639 Bernardo Rex. 2010. Breeding for quantitative traits in plants. Second edition. Stemma Press. 390 pp. Gilmour et al. 1995. Average Information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51(4):1440-1450. Henderson C.R. 1975. Best Linear Unbiased Estimation and Prediction under a Selection Model. Biometrics vol. 31(2):423-447. Kang et al. 2008. Efficient control of population structure in model organism association mapping. Genetics 178:1709-1723. Lee, D.-J., Durban, M., and Eilers, P.H.C. (2013). Efficient two-dimensional smoothing with P-spline ANOVA mixed models and nested bases. Computational Statistics and Data Analysis, 61, 22 - 37. Lee et al. 2015. MTG2: An efficient algorithm for multivariate linear mixed model analysis based on genomic information. Cold Spring Harbor. doi: http://dx.doi.org/10.1101/027201. Maier et al. 2015. Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet; 96(2):283-294. Rodriguez-Alvarez, Maria Xose, et al. Correcting for spatial heterogeneity in plant breeding experiments with P-splines. Spatial Statistics 23 (2018): 52-71. Searle. 1993. Applying the EM algorithm to calculating ML and REML estimates of variance components. Paper invited for the 1993 American Statistical Association Meeting, San Francisco. Yu et al. 2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Genetics 38:203-208. Tunnicliffe W. 1989. On the use of marginal likelihood in time series model estimation. JRSS 51(1):15-27.