[R] aggregate produces results in unexpected format

Rui Barradas ru|pb@rr@d@@ @end|ng |rom @@po@pt
Wed Dec 11 23:17:10 CET 2024


Às 20:31 de 11/12/2024, Sorkin, John escreveu:
> I am trying to use the aggregate function to run a function, catsbydat2, that produces the mean, minimum, maximum, and number of observations of the values in a dataframe, inJan2Test, by levels of the dataframe variable MyDay. The output should be in the form of a dataframe.
> 
> #my code:
> # This function should process a data frame and return a data frame
> # containing the mean, minimum, maximum, and number of observations
> # in the data frame for each level of MyDay.
> catsbyday2 <- function(df){
>    # Create a matrix to hold the calculated values.
>    xx <- matrix(nrow=1,ncol=4)
>    # Give names to the columns.
>    colnames(xx) <- c("Mean","min","max","Nobs")
>    cat("This is the matrix that will hold the results\n",xx,"\n")
> 
>    # For each level of the indexing variable, MyDay, compute the
>    # mean, minimum, maximum, and number of observations in the
>    # dataframe passed to the function.
>    xx[,1] <- mean(df)
>    xx[,2] <- min(df)
>    xx[,3] <- max(df)
>    xx[,4] <- length(df)
>    cat("These are the dimensions of the matrix in the function",dim(xx),"\n")
>    print(xx)
>    return(xx)
> }
> 
> # Create data frame
> inJan2Test <- data.frame(MyDay=rep(c(1,2,3),4),AveragePM2_5=c(10,20,30,
>                                                                11,21,31,
>                                                                12,22,32,
>                                                                15,25,35))
> str(inJan2Test)
> cat("This is the data frame","\n")
> inJan2Test
> 
> xx <- aggregate(inJan2Test[,"AveragePM2_5"],list(inJan2Test[,"MyDay"]),catsbyday2,simplify=FALSE)
> xx
> class(xx)
> str(xx)
> names(xx)
> 
> # Create a data frame in the format that I expect aggregate would return
> examplar <- data.frame(mean=c(12,22,32),min=c(10,20,30),max=c(15,25,35),length=c(4,4,4))
> examplar
> str(examplar)
> 
> 
> While the output is correct (the mean, mean etc. are correctly calculated), the format of the output is not what I want.
> 
> (1) Although the returned object appears to be a data frame, it does appear to be a "normal" data frame. (see the output of
> (2) The column names I define in the function are not part of the data frame that is created.
> (3) The returned values on each row are separated by commas. I would expect them to be separated by spaces.
> (4) When I run str() on the output it appears that the output dataframe contains a list.
>> str(xx)
> 'data.frame':	3 obs. of  2 variables:
>   $ Group.1: num  1 2 3
>   $ x      :List of 3
>    ..$ : num [1, 1:4] 12 10 15 4
>    .. ..- attr(*, "dimnames")=List of 2
>    .. .. ..$ : NULL
>    .. .. ..$ : chr [1:4] "Mean" "min" "max" "Nobs"
>    ..$ : num [1, 1:4] 22 20 25 4
>    .. ..- attr(*, "dimnames")=List of 2
>    .. .. ..$ : NULL
>    .. .. ..$ : chr [1:4] "Mean" "min" "max" "Nobs"
>    ..$ : num [1, 1:4] 32 30 35 4
>    .. ..- attr(*, "dimnames")=List of 2
>    .. .. ..$ : NULL
>    .. .. ..$ : chr [1:4] "Mean" "min" "max" "Nobs"
> 
> I want it to simply be a numeric dataframe:
> 
> mean  min max length
>     12      10    15     4
>     22      20    25     4
>     32      30     35    4
> 
> which should return the following str
> 
> examplar <- data.frame(mean=c(12,22,32),min=c(10,20,30),max=c(15,25,35),length=c(4,4,4))
> examplar
> str(examplar)
> 
> 'data.frame':	3 obs. of  4 variables:
>   $ mean  : num  12 22 32
>   $ min   : num  10 20 30
>   $ max   : num  15 25 35
>   $ length: num  4 4 4
> 
> John David Sorkin M.D., Ph.D.
> Professor of Medicine, University of Maryland School of Medicine;
> Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center;
> PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center;
> Senior Statistician University of Maryland Center for Vascular Research;
> 
> Division of Gerontology and Paliative Care,
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> Cell phone 443-418-5382
> 
> 
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Hello,

The code can be made much simpler. The summary statistics function is a 
one-liner, it just computes and returns a named vector.
But the statistics are now in a matrix, the last column is a matrix 
column and if you print the result, agg, you will see the name 
AveragePM2_5 with suffixes "mean", "min", "max" and "nobs" appended.

You can solve this by removing that column from the result and cbind it 
with the rest of the agg data.frame.




catsbyday2 <- function(x) {
   c(mean = mean(x), min = min(x), max = max(x), nobs = length(x))
}

agg <- aggregate(AveragePM2_5 ~ MyDay, inJan2Test, FUN = catsbyday2)

# The 2nd column is a matrix 3x4
str(agg)
#> 'data.frame':    3 obs. of  2 variables:
#>  $ MyDay       : num  1 2 3
#>  $ AveragePM2_5: num [1:3, 1:4] 12 22 32 10 20 30 15 25 35 4 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:4] "mean" "min" "max" "nobs"

# this solves it, the method cbind.data.frame is
# called since the 1st argument is a df
cbind(agg[-ncol(agg)], agg[[ncol(agg)]])
#>   MyDay mean min max nobs
#> 1     1   12  10  15    4
#> 2     2   22  20  25    4
#> 3     3   32  30  35    4


# a data.frame
agg[-ncol(agg)]
#>   MyDay
#> 1     1
#> 2     2
#> 3     3

# the matrix column
agg[[ncol(agg)]]
#>      mean min max nobs
#> [1,]   12  10  15    4
#> [2,]   22  20  25    4
#> [3,]   32  30  35    4



Hope this helps,

Rui Barradas


-- 
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença de vírus.
www.avg.com



More information about the R-help mailing list