[R] Sanity check in loading large dataframe

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Fri Aug 6 13:28:46 CEST 2021


... but remove the which() and use logical indexing ...  ;-)

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Aug 6, 2021 at 12:57 AM PIKAL Petr <petr.pikal using precheza.cz> wrote:

> Hi
>
> You already got answer from Avi. I often use dim(data) to inspect how many
> rows/columns I have.
> After that I check if some columns contain all or many NA values.
>
> colSums(is.na(data))
> keep <- which(colSums(is.na(data))<nnn)
> cleaned.data <- data[, keep]
>
> Cheers
> Petr
>
>
> > -----Original Message-----
> > From: R-help <r-help-bounces using r-project.org> On Behalf Of Luigi Marongiu
> > Sent: Friday, August 6, 2021 7:34 AM
> > To: Duncan Murdoch <murdoch.duncan using gmail.com>
> > Cc: r-help <r-help using r-project.org>
> > Subject: Re: [R] Sanity check in loading large dataframe
> >
> > Ok, so nothing to worry about. Yet, are there other checks I can
> implement?
> > Thank you
> >
> > On Thu, 5 Aug 2021, 15:40 Duncan Murdoch, <murdoch.duncan using gmail.com>
> > wrote:
> >
> > > On 05/08/2021 9:16 a.m., Luigi Marongiu wrote:
> > >  > Hello,
> > >  > I am using a large spreadsheet (over 600 variables).
> > >  > I tried `str` to check the dimensions of the spreadsheet and I got
> > > > ```  >> (str(df))  > 'data.frame': 302 obs. of  626 variables:
> > >  >   $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
> > >  > ....
> > >  > $ v1_medicamento___aceta    : int  1 NA NA NA NA NA NA NA NA NA ...
> > >  >    [list output truncated]
> > >  > NULL
> > >  > ```
> > >  > I understand that `[list output truncated]` means that there are
> > > more  > variables than those allowed by str to be displayed as rows.
> > > Thus I  > increased the row's output with:
> > >  > ```
> > >  >
> > >  >> (str(df, list.len=1000))
> > >  > 'data.frame': 302 obs. of  626 variables:
> > >  >   $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
> > >  > ...
> > >  > NULL
> > >  > ```
> > >  >
> > >  > Does `NULL` mean that some of the variables are not closed?
> > > (perhaps a  > missing comma somewhere)  > Is there a way to check the
> > > sanity of the data and avoid that some  > separator is not in the
> > > right place?
> > >  > Thank you
> > >
> > > The NULL is the value returned by str().  Normally it is not printed,
> > > but when you wrap str in parens as (str(df, list.len=1000)), that
> > > forces the value to print.
> > >
> > > str() is unusual in R functions in that it prints to the console as it
> > > runs and returns nothing.  Many other functions construct a value
> > > which is only displayed if you print it, but something like
> > >
> > > x <- str(df, list.len=1000)
> > >
> > > will print the same as if there was no assignment, and then assign
> > > NULL to x.
> > >
> > > Duncan Murdoch
> > >
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list