[R] Summary tables of large datasets including character and numerical variables

Duncan Murdoch murdoch.duncan at gmail.com
Tue Dec 27 03:56:02 CET 2011


On 11-12-26 5:44 AM, sparandekar wrote:
> Hello !
>
> I am attempting to switch from being a long time SAS user to R, and would
> really appreciate a bit of help ! The first thing I do in getting a large
> dataset (thousands of obervations and hundreds of variables) is to run a SAS
> command PROC CONTENTS VARNUM command - this provides me a table with the
> name of each variable, its type and length;  then I run a PROC MEANS - for
> numerical variables it gives me a table with the number of non-missing
> values, min, max, mean and std. dev.  My data usually has errors and this
> first step helps me to spot the errors and 'clean' the dataset.
>
> The 'summary' function in R and other function as part of Hmisc or Psych
> package do not work for me.
>
> How can I get a table from an R data.frame that has the following structure
> (header row and example).
>
> Rowname  Character/Integer  Length   Non-Missing    Minimum
> Maximum              Mean                   SD
>
> HHID            Integer                       12            32,344
> 114455007701   514756007812       2.345 x 10^10    1.456 x 10^10
> Head            Character                   38            24,566
> -                                   -                         -
> -

Using the tables package, you can get something like that as follows, 
assuming that "df" is your dataframe:

nonmissing <- function(x) sum(!is.na(x))

tabular(All(df, character=TRUE) ~ (typeof + length + nonmissing + min + 
max + mean + sd))

It isn't perfect:  it will skip anything that isn't numeric or character 
(e.g. factors).  There are ways to work around that, but they aren't as 
simple as you might like.  You can also use

sapply(df, class)

to see the classes of all the columns.

Duncan Murdoch



More information about the R-help mailing list