This vignette offers an in-depth exploration of methodologies and statistical tests for comparing two independent or paired groups using the package groupcompare
in R. It covers a range of techniques, including parametric and non-parametric tests, permutation methods, and bootstrap approaches. By simulating normal distributed data for two groups, the document provides a step-by-step guide to data preparation, visualization, and analysis. The centerpiece of the vignette is the core function groupcompare
, which is employed to compare group statistics and interpret the results thoroughly. The vignette further highlights the adaptability of the statistical tests, allowing users to tailor them to meet specific statistics and unique analytical scenarios.
The recent version of the package from CRAN is installed with the following command:
install.packages("groupcompare", dep=TRUE)
If you have already installed ‘groupcompare
’, you can load it into R working environment by using the following command:
library("groupcompare")
The dataset to be analyzed should be long-formatted where the values are entered in the first column and the group names are entered in the second column. In the following code chunk, a dataset named ds1
is created using the ghdist
function to simulate a G&H distribution. The generated dataset contains data for two groups named A
and B
, each consisting of 25 observations, with a mean of 50 and a standard deviation of 2. In the example, by assigning zeros to the skewness (g
) and kurtosis (h
) arguments, the simulated data is intended to have a normal distribution. As expected, the means and variances of groups A
and B
are made equal. In the example, the generated dataset is in wide format, and immediately after, it is converted to long format using the wide2long
function to create the dataset ds2
. This provides an idea of the long data format, and as can be seen, in the long data format, the first column contains the observation values, while the second column contains the group names or labels. As understood from the example, different groups can be created by changing the means, variances, skewness, and kurtosis parameters.
set.seed(30) # For reproducibility purpose
<- ghdist(25, 50, 2, g=0, h=0)
grp1 <- ghdist(25, 50, 2, g=0, h=0)
grp2 <- data.frame(grp1=grp1, grp2=grp2)
ds1 head(ds1)
## grp1 grp2
## 1 47.42296 47.80720
## 2 49.30462 48.93156
## 3 48.95674 47.15759
## 4 52.54695 47.51452
## 5 53.64904 50.46387
## 6 46.97738 46.54960
# Data in long format
<- wide2long(ds1)
ds2 head(ds2)
## obs group
## 1 47.42296 grp1
## 2 49.30462 grp1
## 3 48.95674 grp1
## 4 52.54695 grp1
## 5 53.64904 grp1
## 6 46.97738 grp1
For statistical tests, data visualization is performed before the analysis to provide insights about the structure or distribution of the data. In the comparison of two groups using parametric tests such as the t-test, visualization provides preliminary information on whether the assumptions of the test are satisfied. The bivarplot
function in the following code chunk facilitates the examination and comparison of group data using various plots.
bivarplot(ds2)
The groupcompare
function of the package, given an example of its usage in the following code chunk, compares the groups in the dataset using various statistical tests and returns the results. In the code chunk below, statistic
is the name of the function that calculates and returns the differences in the statistics of the groups to compare them. In the example, calcstatdif
is a function that calculates and returns the differences between group means, medians, interquartile ranges, and variances. Among its arguments, cl
shows the confidence level, and R
shows the number of resampling for permutation tests and bootstrap. In the function call qtest
describes whether a quantile comparison test is also performed. In the example, it is set to FALSE
. Among other arguments, plots
indicates whether to generate the visualization of the group values as shown in the example above, while setting verbose
to TRUE
shows all the steps and analysis results in detail during the run. If the results of the analysis are to be saved, setting the out
argument to TRUE
saves the results in a file named *dataframe_name*.txt
.
<- groupcompare(ds2, alternative="two.sided", cl=0.95, qtest=FALSE,
results R=1000, plots=FALSE, out=FALSE, verbose=TRUE)
## n min max mean se trmean med mad
## grp1 25 46.33202 53.64904 49.53190 0.4049744 49.47496 49.30462 1.844974
## grp2 25 44.13116 54.33940 49.48523 0.4794747 49.45577 49.36481 2.517676
## skew kurt winsmean hubermean range iqr sd
## grp1 0.26693039 -0.8987614 49.52091 49.44165 7.31702 2.373299 2.024872
## grp2 -0.01674843 -0.5040030 49.46296 49.44833 10.20824 3.255761 2.397374
## P10 P20 P30 P40 P50 P60 P70 P80
## grp1 47.05897 47.84736 48.51477 48.83583 49.30462 49.91318 50.47382 51.52705
## grp2 46.68574 47.44314 48.13850 48.85853 49.36481 50.38383 50.45924 51.27453
## P90
## grp1 52.23230
## grp2 52.31066
##
## Shapiro-Wilk normality test
##
## data: grp1
## W = 0.97285, p-value = 0.7176
##
##
## Shapiro-Wilk normality test
##
## data: grp2
## W = 0.98871, p-value = 0.9911
## p.value
## 0.4356048
##
## Two Sample t-test
##
## data: grp1 and grp2
## t = 0.074356, df = 48, p-value = 0.941
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.215237 1.308571
## sample estimates:
## mean of x mean of y
## 49.53190 49.48523
##
## Wilcoxon rank sum exact test
##
## data: grp1 and grp2
## W = 314, p-value = 0.9847
## alternative hypothesis: true location shift is not equal to 0
## MEAN MED IQR VAR SKEW KURT
## 0.954 1.000 0.408 0.402 0.601 0.606
## mean.bca.lower mean.bca.upper med.bca.lower med.bca.upper
## -1.065792 1.219093 -1.731947 1.563317
The descriptives
component in the result object is a data frame containing common descriptive statistics related to the two compared groups. The inference
component of the result list contains the statistics and significance values related to appropriate group comparison tests.
$descriptives results
## n min max mean se trmean med mad
## grp1 25 46.33202 53.64904 49.53190 0.4049744 49.47496 49.30462 1.844974
## grp2 25 44.13116 54.33940 49.48523 0.4794747 49.45577 49.36481 2.517676
## skew kurt winsmean hubermean range iqr sd
## grp1 0.26693039 -0.8987614 49.52091 49.44165 7.31702 2.373299 2.024872
## grp2 -0.01674843 -0.5040030 49.46296 49.44833 10.20824 3.255761 2.397374
$meanstest results
## t t.p
## 0.07435647 0.94103576
$inference results
## [1] "According to the t-test, the difference of the group means is not significantly different from zero, t(25,25)=0.074, p=0.941."
The meanstest
component in the result includes statistics related to group comparison tests and the significance probability value for the hypothesis test conducted at alpha = 1-cl Type I error. In the output includes test
includes the test result used at inference. Specifically, if the assumptions of the parametric t-test are met, the t-test results are returned; if not, non-parametric test results are provided. If the distributions are not similar, results from bootstrap and permutation tests are presented.
In the output, t
and t.p
are the t-statistic and significance probability for the t-test, respectively; W
and W.p
are the W-statistic and significance probability for the relevant non-parametric test, respectively. Looking at the results obtained. In the results, per.mean
and per.med
show the p-values determined with the permutation test for the mean and median, respectively. In the output, mean.bca.lower
and mean.bca.upper
, along with med.bca.lower
and med.bca.upper
, represent the lower and upper limits of the confidence intervals calculated by the BCa method at the cl
confidence level for the mean and median, respectively.
With the groupcompare
function of the package, it is also possible to test whether the differences in statistics other than the mean and median are significant. To do this, an external R function that calculates the differences for the interested statistics should be coded and assigned to the statistic
argument of the function. Once this function code is loaded into the global environment in the R environment and its name is assigned to the statistic argument, group comparisons for the desired statistics can be performed.
Essentially, the calcstatdif
function in the package is also such a function. This function is used for permutation and bootstrap tests in addition to the t-test and non-parametric test for the mean and median. The function also applies permutation and bootstrap tests for differences in the interquartile range (IQR
) and variance (VAR
) of the groups, alongside the mean and median. To see these outputs, it is sufficient to set the verbose
argument of the groupcompare
function to TRUE
as done in the function call of the example above. In the output, MEAN
, MED
, IQR
, VAR
, SKEW
and KURT
show the p-values from permutation test and bootstrap confidence intervals for the differences in the mean, median, interquartile range, variance, skewness and kurtosis, respectively. Additionally, it should be noted that these results can be saved to a file in the working directory by setting the out
argument to TRUE
.
In order to make a comparison in terms of quantiles of the groups, the qtest
argument of the groupcompare
function should be set to TRUE
, and pass the quantiles you want to compare as a percentile vector to the q
argument.