ordinalClust is an R package to perform classification, clustering and co-clustering of ordinal data. Furthermore, it can handle different numbers of levels and missing values. The ordinal data is considered to follow a BOS distribution [@biernacki16], which is specific to this kind of data. The Latent Block Model is used for performing co-clustering [@jacques17].

```
set.seed(1)
```

```
library(ordinalClust)
```

The package contains real datasets created from [@Anota17]. They relate to quality of life questionnaires for patients affected by breast cancer.

**dataqol**is a dataframe with 121 lines such that each line represents a patient and the columns contain information about the patient:- Id: patient Id
- q1-q28: responses to 28 questions with the number of levels equal to 4
- q29-q30: responses to 2 questions with the number of levels equal to 7

**dataqol.classif**is a dataframe with 40 lines such that a line represents a patient and the columns contain information about the patient:- Id: patient Id
- q1-q28: responses to 28 questions with the number of levels equal to 4
- q29-q30: responses to 2 questions with the number of levels equal to 7
- death: if the patient survived (1) or not (2).

To simulate a sample of ordinal data following the BOS distribution, the function **pejSim** is used.

This snippet of code creates a sample of ordinal data with 7 levels that follows a BOS distribution parameterized by mu=5 and pi=0.5:

```
m=7
nr=10000
mu=5
pi=0.5
probaBOS=rep(0,m)
for (im in 1:m) probaBOS[im]=pejSim(im,m,mu,pi)
M <- sample(1:m,nr,prob = probaBOS, replace=TRUE)
```

To plot the resulting distribution, the **ggplot2** library can be used.

In this section, clustering is executed using the **dataqol** dataset. The purpose of performing clustering is to highlight the structure through the matrix rows.

```
set.seed(0)
library(ordinalClust)
data("dataqol")
# loading the ordinal data
M <- as.matrix(dataqol[,2:29])
m = 4
krow = 3
nbSEM=100
nbSEMburn=90
nbindmini=2
init = "randomBurnin"
percentRandomB = c(30)
object <- bosclust(x=M,kr=krow, m=m, nbSEM=nbSEM,
nbSEMburn=nbSEMburn, nbindmini=nbindmini,
percentRandomB=percentRandomB, init=init)
```

```
plot(object)
```

In this example, co-clustering is performed using the **dataqol** dataset. In this case, the interest in performing co-clustering is to detect an internal structure throughout the rows and columns of the data.

```
set.seed(0)
library(ordinalClust)
# loading the real dataset
data("dataqol")
# loading the ordinal data
M <- as.matrix(dataqol[,2:29])
# defining different number of categories:
m=4
# defining number of row and column clusters
krow = 3
kcol = 3
# configuration for the inference
nbSEM=100
nbSEMburn=90
nbindmini=2
init = "randomBurnin"
percentRandomB = c(30, 30)
# Co-clustering execution
object <- boscoclust(x = M,kr = krow, kc = kcol, m = m,
nbSEM = nbSEM, nbSEMburn = nbSEMburn,
nbindmini = nbindmini, init = init,
percentRandomB = percentRandomB)
```

This snippet of code shows how to visualize the resulting co-clustering, using the **plot** function:

```
plot(object)
```

In this section, the dataset **dataqol.classif** is used. It contains the responses to a questionnaire by 40 patients affected by breast cancer. Furthermore, a column labeled **death** indicates whether the patient died from the disease (2) or not (1). The aim of this section is to predict the classes of a validation dataset from a training dataset.

The classification function **bosclassif** provides two classiﬁcation models. The ﬁrst model, (chosen by the option kc=0), is a multivariate BOS model with the assumption that, conditional to the class of the observations, the features are independent. The second model is a parsimonious version of the ﬁrst model. Parsimony is introduced by grouping the features into clusters (as in co-clustering) and assuming that the features of a cluster have a common distribution. The number L of clusters of features is defined with the option kc=L. In practice, L can be chosen by cross-validation, as shown in the following example:

```
set.seed(1)
library(ordinalClust)
# loading the real dataset
data("dataqol.classif")
# loading the ordinal data
M <- as.matrix(dataqol.classif[,2:29])
# creating the classes values
y <- as.vector(dataqol.classif$death)
# sampling datasets for training and to predict
nb.sample <- ceiling(nrow(M)*7/10)
sample.train <- sample(1:nrow(M), nb.sample, replace=FALSE)
M.train <- M[sample.train,]
M.validation <- M[-sample.train,]
nb.missing.validation <- length(which(M.validation==0))
y.train <- y[sample.train]
y.validation <- y[-sample.train]
# number of classes to predict
kr <- 2
# configuration for SEM algorithm
nbSEM=200
nbSEMburn=175
nbindmini=2
init="randomBurnin"
percentRandomB = c(50, 50)
# different kc to test with cross-validation
kcol <- c(0,1,2,3)
m <- 4
# matrix that contains the predictions for all different kc
preds <- matrix(0,nrow=length(kcol),ncol=nrow(M.validation))
for(kc in 1:length(kcol)){
res <- bosclassif(x=M.train, y=y.train,
kr=kr, kc=kcol[kc], m=m,
nbSEM=nbSEM, nbSEMburn=nbSEMburn,
nbindmini=nbindmini, init=init, percentRandomB=percentRandomB)
new.prediction <- predict(res, M.validation)
preds[kc,] <- new.prediction@zr_topredict
}
preds = as.data.frame(preds)
row.names <- c()
for(kc in kcol){
name= paste0("kc=",kc)
row.names <- c(row.names,name)
}
rownames(preds)=row.names
```

```
library(caret)
actual <- y.validation -1
specificities <- rep(0,length(kcol))
sensitivities <- rep(0,length(kcol))
for(i in 1:length(kcol)){
prediction <- unlist(as.vector(preds[i,])) -1
u <- union(prediction, actual)
conf_matrix<-table(factor(prediction, u),factor(actual, u))
sensitivities[i] <- recall(conf_matrix)
specificities[i] <- specificity(conf_matrix)
}
sensitivities
```

```
## [1] 1.0 0.5 1.0 1.0
```

```
specificities
```

```
## [1] 0.125 0.625 0.375 0.125
```

The package can deal with ordinal data with different numbers of levels. In this section, we show how to introduce these kinds of datasets in a co-clustering context.

In this example, co-clustering is performed using the dataset **dataqol**, by including the questions with 4 levels, and questions with 7 levels. The function **boscoclustMulti** is executed, **which might take a few minutes**.

```
set.seed(0)
library(ordinalClust)
# loading the real dataset
data("dataqol")
# loading the ordinal data
M <- as.matrix(dataqol[,2:31])
# defining different number of categories:
m=c(4,7)
# defining number of row and column clusters
krow = 3
kcol = c(3,1)
# configuration for the inference
nbSEM=50
nbSEMburn=40
nbindmini=2
init='random'
d.list <- c(1,29)
# Co-clustering execution
object <- boscoclust(x=M,kr=krow,kc=kcol,m=m, idx_list=d.list,
nbSEM=nbSEM,nbSEMburn=nbSEMburn,
nbindmini=nbindmini, init=init)
```