--- title: "Mixed Models and Smoothing" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: false number_sections: true bibliography: bibliography.bib link-citations: yes vignette: > %\VignetteIndexEntry{Mixed Models and Smoothing} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.dim = c(7, 4) ) ``` # The LMMsolver package {.unnumbered} The aim of the `LMMsolver` package is to provide an efficient and flexible system to estimate variance components using restricted maximum likelihood or REML [@Patterson1971], for models where the mixed model equations are sparse. An important feature of the package is smoothing with P-splines [@Eilers1996]. The sparse mixed model P-splines formulation [@boer2023] is used, which makes the computations fast. A Linear Mixed Model (LMM) has the form $$ y = X \beta + Z u + e, \quad u \sim N(0,G), \quad e \sim N(0,R) \;, $$ where $y$ is a vector of observations, $\beta$ is a vector with the fixed effects, $u$ is a vector with the random effects, and $e$ a vector of random residuals. $X$ and $Z$ are design matrices. The `LMMsolver` package can fit models where the matrices $G^{-1}$ and $R^{-1}$ are a linear combination of precision matrices $Q_{G,i}$ and $Q_{R,i}$: $$ G^{-1} = \sum_{i} \psi_i Q_{G,i} \;, \quad R^{-1} = \sum_{i} \phi_i Q_{R,i} \;, $$ where the precision parameters $\psi_i$ and $\phi_i$ are estimated using REML. For most standard mixed models $1/{\psi_i}$ are the variance components and $1/{\phi_i}$ the residual variances. We use a formulation in terms of precision parameters to allow for non-standard mixed models using tensor product splines introduced in @Rodriguez-Alvarez2015. If the matrices $G^{-1}$ and $R^{-1}$ are sparse, the mixed model equations can be solved using efficient sparse matrix algebra implemented in the `spam` package [@Furrer2010]. To calculate the derivatives of the log-likelihood in an efficient way, the automatic differentiation of the Cholesky matrix [@Smith1995;@boer2023] was implemented in C++ using the `Rcpp` package [@Eddelbuettel2018]. # Introduction The purpose of this section is to give users an easy introduction, starting from simple linear regression. Based on simulations we will explain the main functions, the input and the output. First we load the `LMMsolver` and `ggplot2` packages: ```{r setup} library(LMMsolver) library(ggplot2) ``` ## Linear Regression We will start with a simple example where the true function is linear in variable $x$: ```{r linearFun} f1 <- function(x) { 0.6 + 0.7*x} ``` Using this function we simulate data and add normal distributed noise: ```{r linearFunSim} set.seed(2016) n <- 25 x <- seq(0, 1, length = n) sigma2e <- 0.04 y <- f1(x) + rnorm(n, sd = sqrt(sigma2e)) dat1 <- data.frame(x = x, y = y) ``` We can fit the data using the `LMMsolve` function: ```{r fit1} obj1 <- LMMsolve(fixed = y ~ x, data = dat1) ``` We can make predictions using the `predict()` function: ```{r predictlin} newdat <- data.frame(x = seq(0, 1, length = 300)) pred1 <- predict(obj1, newdata = newdat, se.fit = TRUE) # adding the true values for comparison pred1$y_true <- f1(pred1$x) ``` Note that for this linear model we could have used the standard `lm()` function, which will give the same result. The following plot gives the simulated data with the predictions, and pointwise standard-error bands. The true value is plotted as dashed red line. ```{r ggplotLinear} ggplot(data = dat1, aes(x = x, y = y)) + geom_point(col = "black", size = 1.5) + geom_line(data = pred1, aes(y=y_true), color = "red", linewidth = 1, linetype = "dashed") + geom_line(data = pred1, aes(y = ypred), color = "blue", linewidth = 1) + geom_ribbon(data = pred1, aes(x=x,ymin = ypred-2*se, ymax = ypred+2*se), alpha = 0.2, inherit.aes = FALSE) + theme_bw() ``` ## Fitting a non-linear function In this section we will use the following non-linear function for the simulations: ```{r nonlinearFun} f2 <- function(x) { 0.3 + 0.4*x + 0.2*sin(20*x) } ``` The simulated data is generated by the following code ```{r simDat2} set.seed(12) n <- 150 x <- seq(0, 1, length = n) sigma2e <- 0.04 y <- f2(x) + rnorm(n, sd = sqrt(sigma2e)) dat2 <- data.frame(x, y) ``` We can use the `spline` argument to fit the non-linear trend: ```{r nonlinearFit} obj2 <- LMMsolve(fixed = y ~ 1, spline = ~spl1D(x, nseg = 50), data = dat2) ``` where `spl1D(x, nseg = 50)` defines a mixed model P-splines with 50 segments. The model fit can be summarized in terms of effective dimensions: ```{r summary_nonlinear} summary(obj2) ``` The intercept and the slope `lin(x)` define the linear (or fixed) part of the model, the non-linear (or random) part is defined by `s(x)`, with effective dimension `r round(obj2$EDdf[3,2],2)`. Making predictions on the interval $[0,1]$ and plotting can be done in the same way as for the linear regression example: ```{r predict_non} newdat <- data.frame(x = seq(0, 1, length = 300)) pred2 <- predict(obj2, newdata = newdat, se.fit = TRUE) pred2$y_true <- f2(pred2$x) ggplot(data = dat2, aes(x = x, y = y)) + geom_point(col = "black", size = 1.5) + geom_line(data = pred2, aes(y = y_true), color = "red", linewidth = 1, linetype ="dashed") + geom_line(data = pred2, aes(y = ypred), color = "blue", linewidth = 1) + geom_ribbon(data= pred2, aes(x=x, ymin = ypred-2*se, ymax = ypred+2*se), alpha=0.2, inherit.aes = FALSE) + theme_bw() ``` ## Non-Gaussian data The `LMMsolver` package can also be used for non-gaussian data, using the `family` argument, with default `family = gaussian`. As an example we use count data using the Poisson distribution, defined by $$ Pr(X=k) = \frac{\lambda^k e^{-\lambda}}{k!} \;, $$ with parameter $\lambda > 0$ and $k$ is the number of occurrences. More general, the value of the parameter $lambda$ can depend on another variable time $x$, for example time. Here we will assume that $x$ is defined on the interval $[0,1]$ and defined by: $$ \lambda(x) = 4 + 3x + 4 \sin(7 x) $$ Using this function we simulate the following data ```{r poissonSim} set.seed(1234) n <- 150 x <- seq(0, 1, length=n) fun_lambda <- function(x) { 4 + 3*x + 4*sin(7*x) } x <- seq(0, 1, length = n) y <- rpois(n = n, lambda = fun_lambda(x)) dat3 <- data.frame(x = x, y = y) ``` Now we fit the data with the argument `family = poisson()`: ```{r Poisson_LMMsolver} obj3 <- LMMsolve(fixed = y ~ 1, spline = ~spl1D(x, nseg = 50), family = poisson(), data = dat3) summary(obj3) ``` Making predictions and plotting the data is similar to the Gaussian data we showed before: ```{r predict_poisson} newdat <- data.frame(x = seq(0, 1, length = 300)) pred3 <- predict(obj3, newdata = newdat, se.fit = TRUE) pred3$y_true <- fun_lambda(pred3$x) ggplot(data = dat3, aes(x = x, y = y)) + geom_point(col = "black", size = 1.5) + geom_line(data = pred3, aes(y = y_true), color = "red", linewidth = 1, linetype ="dashed") + geom_line(data = pred3, aes(y = ypred), color = "blue", linewidth = 1) + geom_ribbon(data= pred3, aes(x=x, ymin = ypred-2*se, ymax = ypred+2*se), alpha=0.2, inherit.aes = FALSE) + theme_bw() ``` ## Smoothing combining two experiments In this section we will give a bit more complicated example, to show some further options of `LMMsolver`. Suppose there are two experiments, A and B, with the same true unobserved non-linear function `f2(x)` as defined before. The simulated data is given by the following code: ```{r twoExperiments} set.seed(1234) nA <- 50 nB <- 100 mu_A <- 0.10 mu_B <- -0.10 sigma2e_A <- 0.04 sigma2e_B <- 0.10 x1 <- runif(n = nA) x2 <- runif(n = nB) y1 <- f2(x1) + rnorm(nA, sd = sqrt(sigma2e_A)) + mu_A y2 <- f2(x2) + rnorm(nB, sd = sqrt(sigma2e_B)) + mu_B Experiment <- as.factor(c(rep("A", nA), rep("B", nB))) dat4 <- data.frame(x = c(x1, x2), y = c(y1,y2), Experiment = Experiment) ``` Before analyzing the data in further detail a boxplot gives some insight: ```{r boxplot} ggplot(dat4, aes(x = Experiment, y = y, fill = Experiment)) + geom_boxplot() + geom_point(position = position_jitterdodge(), alpha = 0.3) ``` Comparing the two experiments we can see that: 1. There is a clear difference in mean/median between the two experiments. This can be corrected for by adding the argument `random = ~Experiment`. 2. The variance in experiment A is smaller than in B. This implies that is important to allow for heterogeneous variances which can be modelled by defining `residual = ~Experiment`. The model in `LMMsolve()` is given by: ```{r twoExpSolve} obj4 <- LMMsolve(fixed= y ~ 1, spline = ~spl1D(x, nseg = 50), random = ~Experiment, residual = ~Experiment, data = dat4) ``` The table of effective dimensions is given by: ```{r summaryExp} summary(obj4) ``` And making the predictions: ```{r twoExpPredict} newdat <- data.frame(x=seq(0, 1, length = 300)) pred4 <- predict(obj4, newdata = newdat, se.fit = TRUE) pred4$y_true <- f2(pred4$x) ggplot(data = dat4, aes(x = x, y = y, colour = Experiment)) + geom_point(size = 1.5) + geom_line(data = pred4, aes(y = y_true), color="red", linewidth = 1, linetype = "dashed") + geom_line(data = pred4, aes(y = ypred), color = "blue", linewidth = 1) + geom_ribbon(data = pred4, aes(x = x,ymin = ypred-2*se, ymax = ypred+2*se), alpha = 0.2, inherit.aes = FALSE) + theme_bw() ``` The estimated random effects for Experiment can be obtained using the `coef()` function: ```{r coefExp} coef(obj4)$Experiment ``` The sum of the effects is equal to zero, as expected for a standard random term. # Smooth trends in two dimensions For two-dimensional mixed P-splines as defined in @boer2023 we will use two examples. The first example is US precipitation data. The second example models a data set for Sea Surface Temperature (SST) described in @cressie2022. ## US precipitation example As a first example we use the `USprecip` data set in the spam package [@Furrer2010], analysed in @Rodriguez-Alvarez2015. ```{r USprecip data} ## Get precipitation data from spam data(USprecip, package = "spam") ## Only use observed data USprecip <- as.data.frame(USprecip) USprecip <- USprecip[USprecip$infill == 1, ] ``` The two-dimensional P-spline can be defined with the `spl2D()` function, and with longitude and latitude as covariates. The number of segments chosen here is equal to the number of segments used in @Rodriguez-Alvarez2015. ```{r runobj3} obj5 <- LMMsolve(fixed = anomaly ~ 1, spline = ~spl2D(x1 = lon, x2 = lat, nseg = c(41, 41)), data = USprecip) ``` The summary function gives a table with the effective dimensions and the penalty parameters: ```{r ED_USprecip} summary(obj5) ``` A plot for the smooth trend can be obtained in a similar way as for the one-dimensional examples, using the `predict()` function. First we make predictions on a regular two-dimensional grid: ```{r pred_USprecip} lon_range <- range(USprecip$lon) lat_range <- range(USprecip$lat) newdat <- expand.grid(lon = seq(lon_range[1], lon_range[2], length = 200), lat = seq(lat_range[1], lat_range[2], length = 300)) plotDat5 <- predict(obj5, newdata = newdat) ``` For plotting the predictions for USA main land we use the `maps` and `sf` packages: ```{r Plot_USprecip} plotDat5 <- sf::st_as_sf(plotDat5, coords = c("lon", "lat")) usa <- sf::st_as_sf(maps::map("usa", regions = "main", plot = FALSE)) sf::st_crs(usa) <- sf::st_crs(plotDat5) intersection <- sf::st_intersects(plotDat5, usa) plotDat5 <- plotDat5[!is.na(as.numeric(intersection)), ] ggplot(usa) + geom_sf(color = NA) + geom_tile(data = plotDat5, mapping = aes(geometry = geometry, fill = ypred), linewidth = 0, stat = "sf_coordinates") + scale_fill_gradientn(colors = topo.colors(100))+ labs(title = "Precipitation (anomaly)", x = "Longitude", y = "Latitude") + coord_sf() + theme(panel.grid = element_blank()) ``` ## Sea Surface Temperatures The second example using two-dimensional P-splines is for Sea Surface Temperatures (SST) data [@cressie2022]. In their study they compare a wide range of software packages to analyse the SST data. For the comparison they focus on a region of the ocean known as the Brazil-Malvinas confluence zone, an energetic region of the ocean just off the coast of Argentina and Uruguay, where the warm Brazil current and the cold Malvinas current meet [@cressie2022]. They divided the data within this region into a training and a testing data set, each consisting of approximately 8,000 observations. ```{r SeaSurfaceTemp} data(SeaSurfaceTemp) head(SeaSurfaceTemp, 5) table(SeaSurfaceTemp$type) ``` First we convert SST from Kelvin to Celsius and split the data in the training and test set: ```{r convert_and_split} # convert from Kelvin to Celsius df <- SeaSurfaceTemp df$sst <- df$sst - 273.15 ### split in training and test set df_train <- df[df$type == "train", ] df_test <- df[df$type == "test", ] ``` The next plot shows the raw data, using the same color palette as in @cressie2022. ```{r SST_raw_data} nasa_palette <- c( "#03006d","#02008f","#0000b6","#0001ef","#0000f6","#0428f6","#0b53f7", "#0f81f3","#18b1f5","#1ff0f7","#27fada","#3efaa3","#5dfc7b","#85fd4e", "#aefc2a","#e9fc0d","#f6da0c","#f5a009","#f6780a","#f34a09","#f2210a", "#f50008","#d90009","#a80109","#730005" ) map_layer <- geom_map( data = map_data("world"), map = map_data("world"), aes(group = group, map_id = region), fill = "black", colour = "white", linewidth = 0.1 ) # Brazil-Malvinas confluence zone BM_box <- cbind(lon = c(-60, -48), lat = c(-50, -35)) ggplot() + scale_colour_gradientn(colours = nasa_palette, name = expression(degree*C)) + xlab("Longitude (deg)") + ylab("Latitude (deg)") + map_layer + xlim(BM_box[, "lon"]) + ylim(BM_box[, "lat"]) + theme_bw() + coord_fixed(expand = FALSE) + geom_point(data = df_train, aes(lon, lat, colour = sst), size=0.5) ``` For this complicated data we need more segments for `spl2D()` as in the previous example, because of the strong local changes in Sea Surface Temperatures in this region. ```{r SeaTempAnalysis} obj6 <- LMMsolve(fixed = sst ~ 1, spline = ~spl2D(lon, lat, nseg = c(70, 70), x1lim = BM_box[, "lon"], x2lim = BM_box[, "lat"]), data = df_train, tolerance = 1.0e-1) summary(obj6) ``` The predictions on a grid are shown in the next figure ```{r predictions_grid} lon_range <- BM_box[, "lon"] lat_range <- BM_box[, "lat"] newdat <- expand.grid(lon = seq(lon_range[1], lon_range[2], length = 200), lat = seq(lat_range[1], lat_range[2], length = 200)) pred_grid <- predict(obj6, newdata = newdat, se.fit=TRUE) pred_grid <- pred_grid[pred_grid$se<5, ] ## Plot predictions on a grid ggplot(pred_grid) + geom_tile(aes(x = lon, y = lat, fill = ypred)) + scale_fill_gradientn(colours = nasa_palette) + labs( fill = "pred.", x = "Longitude (deg)", y = "Latitude (deg)" ) + map_layer + theme_bw() + coord_fixed(expand = FALSE, xlim = BM_box[, "lon"], ylim = BM_box[, "lat"]) + scale_x_continuous(breaks = c(-58, -54, -50)) ``` The standard errors for the predictions are in the column `se` in the data frame `pred_grid` and can be plotted using the following code: ```{r seSST} ## Plot standard error ggplot(pred_grid) + geom_raster(aes(x = lon, y = lat, fill = se)) + scale_fill_distiller(palette = "BrBG", direction = -1) + labs( fill = "s.e.", x = "Longitude (deg)", y = "Latitude (deg)") + map_layer + theme_bw() + coord_fixed(expand = FALSE, xlim = c(-60, -48), ylim = c(-50, -35)) + scale_x_continuous(breaks = c(-58, -54, -50)) ``` Predictions for the test set are given by ```{r test_set} pred_test <- predict(obj6, newdata = df_test) ggplot(pred_test, aes(x = sst,y = ypred)) + geom_point() + xlab("observed SST (Celsius)") + ylab("predicted SST (Celsius)") + geom_abline(intercept=0,slope=1,col='red') + theme_bw() ``` Calculation of the root mean squared prediction error (RMSPE) for the test set: ```{r RMSE_test} Y <- (pred_test$sst - pred_test$ypred)^2 RMSE <- sqrt(mean(Y)) round(RMSE, 2) ``` The RMSPE is in the same range (0.44-0.46) as for the software packages used in @cressie2022. On a standard desktop the calculations using `LMMsolver` take less than 10 seconds, taking advantage of the sparse structure of the P-splines mixed model [@boer2023]. # Examples from Quantitative Genetics In this section we will show some examples from quantitative genetics, to illustrate some further options of the package. ## Oats field trial As a first example we will use an oats field trial from the `agridat` package. There were 24 varieties in 3 replicates, each consisting of 6 incomplete blocks of 4 plots. The plots were laid out in a single row. ```{r oatsdata} ## Load data. data(john.alpha, package = "agridat") head(john.alpha) ``` We will use the Linear Variance (LV) model, which is closely connected to the P-splines model [@Boer2020]. First we need to define the precision matrix for the LV model, see Appendix in @Boer2020 for details: ```{r define LVmodel} ## Add plot as factor. john.alpha$plotF <- as.factor(john.alpha$plot) ## Define the precision matrix, see eqn (A2) in Boer et al (2020). N <- nrow(john.alpha) cN <- c(1 / sqrt(N - 1), rep(0, N - 2), 1 / sqrt(N - 1)) D <- diff(diag(N), diff = 1) Delta <- 0.5 * crossprod(D) LVinv <- 0.5 * (2 * Delta + cN %*% t(cN)) ## Add LVinv to list, with name corresponding to random term. lGinv <- list(plotF = LVinv) ``` Given the precision matrix for the LV model we can define the model in LMMsolve using the `random` and `ginverse` arguments: ```{r modelLV} obj7 <- LMMsolve(fixed = yield ~ rep + gen, random = ~plotF, ginverse = lGinv, data = john.alpha) ``` The absolute deviance ($-2*logL$) and variances for the LV-model are ```{r summary_dev_VAR} round(deviance(obj7, relative = FALSE), 2) summary(obj7, which = "variances") ``` as reported in @Boer2020, Table 1. ## Model biomass as function of time. In this section we show an example of mixed model P-splines to fit biomass as function of time. As an example we use wheat data simulated with the crop growth model APSIM. This data set is included in the package. For details on this simulated data see @Bustos-Korts2019. ```{r APSIMdat} data(APSIMdat) head(APSIMdat) ``` The first column is the environment, Emerald in 1993, the second column the simulated genotype (g001), the third column is days after sowing (das), and the last column is the simulated biomass with medium measurement error. The model can be fitted with ```{r APSIMmodel} obj8 <- LMMsolve(fixed = biomass ~ 1, spline = ~spl1D(x = das, nseg = 50), data = APSIMdat) ``` The effective dimensions are: ```{r APSIMmodelSummary} summary(obj8) ``` The fitted smooth trend can be obtained as explained before: ```{r APSIMplot} das_range <- range(APSIMdat$das) newdat <- data.frame(das=seq(das_range[1], das_range[2], length = 300)) pred8 <- predict(obj8, newdata = newdat, se.fit = TRUE) ggplot(data = APSIMdat, aes(x = das, y = biomass)) + geom_point(size = 1.0) + geom_line(data = pred8, aes(y = ypred), color = "blue", linewidth = 1) + geom_ribbon(data = pred8, aes(x = das,ymin = ypred-2*se, ymax = ypred+2*se), alpha = 0.2, inherit.aes = FALSE) + labs(title = "APSIM biomass as function of time", x = "days after sowing", y = "biomass (kg)") + theme_bw() ``` The growth rate (first derivative) as function of time can be obtained using `deriv = 1` in function `obtainSmoothTrend`: ```{r APSIMDeriv} plotDatDt <- obtainSmoothTrend(obj8, grid = 1000, deriv = 1) ggplot(data = plotDatDt, aes(x = das, y = ypred)) + geom_line(color = "red", linewidth = 1) + labs(title = "APSIM growth rate as function of time", x = "days after sowing", y = "growth rate (kg/day)") + theme_bw() ``` ## QTL mapping with IBD probabilities. In QTL-mapping for multiparental populations the Identity-By-Descent (IBD) probabilities are used as genetic predictors in the mixed model [@Li2021]. The following simulated example is for illustration. It consists of three parents (A, B, and C), and two crosses AxB, and AxC. AxB is a population of 100 Doubled Haploids (DH), AxC of 80 DHs. The probabilities, pA, pB, and pC, are for a position on the genome close to a simulated QTL. This simulated data is included in the package. ```{r multipop} ## Load data for multiparental population. data(multipop) head(multipop) ``` The residual (genetic) variances for the two populations can be different. Therefore we need to allow for heterogeneous residual variances, which can be defined by using the `residual` argument in `LMMsolve`: ```{r residualARG} ## Fit null model. obj9 <- LMMsolve(fixed = pheno ~ cross, residual = ~cross, data = multipop) dev0 <- deviance(obj9, relative = FALSE) ``` The QTL-probabilities are defined by the columns pA, pB, pC, and can be included in the random part of the mixed model by using the `group` argument: ```{r groupOPTION} ## Fit alternative model - include QTL with probabilities defined in columns 3:5 lGrp <- list(QTL = 3:5) obj10 <- LMMsolve(fixed = pheno ~ cross, group = lGrp, random = ~grp(QTL), residual = ~cross, data = multipop) dev1 <- deviance(obj10, relative = FALSE) ``` The approximate $-log10(p)$ value is given by ```{r approxPvalue} ## Deviance difference between null and alternative model. dev <- dev0 - dev1 ## Calculate approximate p-value. minlog10p <- -log10(0.5 * pchisq(dev, 1, lower.tail = FALSE)) round(minlog10p, 2) ``` The estimated QTL effects of the parents A, B, and C are given by: ```{r QTLeffects} coef(obj10)$QTL ``` # References