[R-SIG-Mac] M3 not working with torch

Mon May 20 14:48:59 CEST 2024

Dear R-SIG-MAC

I bought a new MacBook Air with the M3 chip, which has 8 CPUs, 10 GPUs, and 16GB of integrated memory. My R `torch` apps are crashing. I have assembled an MWE that works on other Mac architectures, including MacBook Air M1 and MacMini. The OS is the same (Sonoma 14.5). The MWE follows:

```{r}
# ==== MWE

# Download the training samples
rds_file <- "https://raw.githubusercontent.com/e-sensing/sitsdata/master/inst/extdata/torch/train_samples.rds?raw=true"
dest_file <- paste0(tempdir(),"/train_samples.rds")
download.file(rds_file,
              destfile = dest_file,
              method = "curl")
train_samples <- readRDS(dest_file)

# Sample labels
labels <- c("Cerrado", "Forest", "Pasture", "Soy_Corn")

# Create numeric labels vector
code_labels <- seq_along(labels)
names(code_labels) <- labels

# Split the data into training and validation data sets
# Create partitions for different splits of the input data
frac <- 0.2
train_samples <- dplyr::group_by(train_samples, .data[["label"]])
test_samples <- train_samples |>
    dplyr::slice_sample(prop = frac) |>
    dplyr::ungroup()

# Remove the lines used for validation
sel <- !train_samples[["sample_id"]] %in% test_samples[["sample_id"]]
train_samples <- train_samples[sel, ]

# Shuffle the data
train_samples <- train_samples[sample(nrow(train_samples), nrow(train_samples)), ]
test_samples <- test_samples[sample(nrow(test_samples), nrow(test_samples)), ]

# Organize data for model training
train_x <- as.matrix(train_samples[, -2:0])
train_y <- unname(code_labels[train_samples[["label"]]])

# Create the test data
test_x <- as.matrix(test_samples[, -2:0])
test_y <- unname(code_labels[test_samples[["label"]]])

# Set torch seed
torch::torch_manual_seed(sample.int(10^5, 1))

# Avoid a global variable for 'self'
self <- NULL

# function to create a simple sequential NN module
.torch_linear_relu_dropout <- torch::nn_module(
    classname = "torch_linear_batch_norm_relu_dropout",
    initialize = function(input_dim,
                          output_dim,
                          dropout_rate) {
        self$block <- torch::nn_sequential(
            torch::nn_linear(input_dim, output_dim),
            torch::nn_relu(),
            torch::nn_dropout(dropout_rate)
        )
    },
    forward = function(x) {
        self$block(x)
    }
)

# Define the MLP architecture
mlp_model <- torch::nn_module(
    initialize = function(num_pred, layers, dropout_rates, y_dim) {
        tensors <- list()
        # input layer
        tensors[[1]] <- .torch_linear_relu_dropout(
            input_dim = num_pred,
            output_dim = 512,
            dropout_rate = 0.40
        )
        # output layer
        tensors[[length(tensors) + 1]] <-
            torch::nn_linear(layers[length(layers)], y_dim)
        # add softmax tensor
        tensors[[length(tensors) + 1]] <- torch::nn_softmax(dim = 2)
        # create a sequential module that calls the layers in the same
        # order.
        self$model <- torch::nn_sequential(!!!tensors)
    },
    forward = function(x) {
        self$model(x)
    }
)
# Train the model using luz

torch_model <- luz::setup(
    module = mlp_model,
    loss = torch::nn_cross_entropy_loss(),
    metrics = list(luz::luz_metric_accuracy()),
    optimizer = torch::optim_adamw,
)
torch_model <- luz::set_hparams(
    torch_model,
    num_pred = ncol(train_x),
    layers = 512,
    dropout_rates = 0.3,
    y_dim = length(code_labels)
)
torch_model <- luz::set_opt_hparams(
    torch_model,
    lr = 0.001,
    eps = 1e-08,
    weight_decay = 1.0e-06
)
torch_model <- luz::fit(
    torch_model,
    data = list(train_x, train_y),
    epochs = 100,
    valid_data = list(test_x, test_y),
    callbacks = list(luz::luz_callback_early_stopping(
        patience = 20,
        min_delta = 0.01
    )),
  verbose = TRUE
)

```
The error occurs in the `luz::fit` function. Inside RStudio, the code gets stuck, and then RStudio asks to restart R. When running R from the terminal, the output is:
```{r}
 *** caught bus error ***
address 0x16daa0000, cause 'invalid alignment'

 *** caught segfault ***
address 0x9, cause 'invalid permissions'
zsh: segmentation fault  R
```
The `sessionInfo()` output is as follows:

```{r}

R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         zeallot_0.1.0    
 [5] rlang_1.1.3       processx_3.8.4    generics_0.1.3    torch_0.12.0.9000
 [9] coro_1.0.4        glue_1.7.0        bit_4.0.5         prettyunits_1.2.0
[13] luz_0.4.0         ps_1.7.6          hms_1.1.3         fansi_1.0.6      
[17] tibble_3.2.1      progress_1.2.3    lifecycle_1.0.4   compiler_4.4.0   
[21] dplyr_1.1.4       fs_1.6.4          Rcpp_1.0.12       pkgconfig_2.0.3  
[25] rstudioapi_0.16.0 R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4       
[29] pillar_1.9.0      callr_3.7.6       magrittr_2.0.3    tools_4.4.0      
[33] bit64_4.0.5     
```

Any clues will be most appreciated. 

Thanks
Gilberto
============================
Prof Dr Gilberto Camara
Senior Researcher
National Institute for Space Research (INPE), Brazil
https://gilbertocamara.org/
=============================