[R] Sanity check in loading large dataframe

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Thu Aug 5 18:01:52 CEST 2021


Duncan answered part of your question. My feedback is to consider looking at
your data using other tools besides str(). 

There are ways in base R to get lists of row or column names or count them
or ask what types they are and so forth.

Printing an entire large object is hard but printing many subsets can give
you a handle on it.

You may also want to use packages in the tidyverse such as dplyr and work
with tibbles as a mild variation on a data.frame.

I am not sure what you are hoping to do with str() besides getting the
number of rows and columns but consider:


To get names: 

To get many kinds of info about columns in your data.frame, various
functional methods like this can be used:
	sapply(df, typeof)

The above will tell you for each column if it is an integer or double or
other things.
To do more interesting things there are packages. The psych package, for
example, lets you get some metrics about each column:

And you can use various methods of subsetting to limit what you are looking
at and only show or print a manageable amount.

You seem to be asking about sanity checking in your subject line and that
depends on what you want to check. Clearly that can include making sure
various columns of data are valid in being of the expected data type or not
having any NA values or even removing outliers and so on. Tools are there
for much of that including the few I mention. Your data may seem huge but I
have worked on much larger ones. One suggestion is to consider trimming some
of that data before working on it IF some is not needed. Both base R and the
tidyverse have lots to offer to do such things.

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Luigi Marongiu
Sent: Thursday, August 5, 2021 9:16 AM
To: r-help <r-help using r-project.org>
Subject: [R] Sanity check in loading large dataframe

I am using a large spreadsheet (over 600 variables).
I tried `str` to check the dimensions of the spreadsheet and I got ```
> (str(df))
'data.frame': 302 obs. of  626 variables:
 $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
$ v1_medicamento___aceta    : int  1 NA NA NA NA NA NA NA NA NA ...
  [list output truncated]
I understand that `[list output truncated]` means that there are more
variables than those allowed by str to be displayed as rows. Thus I
increased the row's output with:

> (str(df, list.len=1000))
'data.frame': 302 obs. of  626 variables:
 $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...

Does `NULL` mean that some of the variables are not closed? (perhaps a
missing comma somewhere) Is there a way to check the sanity of the data and
avoid that some separator is not in the right place?
Thank you

Best regards,

R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list