[R] Data cleaning & Data preparation, what do R users want?

Dominik Schneider dominik.schneider at colorado.edu
Thu Nov 30 10:11:02 CET 2017


I would agree that getting data into R from various sources is the biggest
pain point. Even if there is an api, the results are not always consistent
and you have to do lots of dimension checking to get it right. Or there
isn't an open api at all and you have to hack it by web scraping or
otherwise- http://enpiar.com/2017/08/11/one-hour-package/

On Thu, Nov 30, 2017 at 1:00 AM, Jim Lemon <drjimlemon at gmail.com> wrote:

> Hi again,
> Typo in the last email. Should read "about 40 standard deviations".
>
> Jim
>
> On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon <drjimlemon at gmail.com> wrote:
> > Hi Robert,
> > People want different levels of automation in the software they use.
> > What concerns many of us is the desire for the function
> > "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
> > Such users typically want something that justifies its use by being
> > written by someone who seems to know what they're doing and lots of
> > other people use it. One advantage of many R functions is their
> > modular construction. This encourages users to at least consider the
> > steps that are taken rather than just accept what comes out of that
> > long tube.
> >
> > Take the contentious problem of outlier identification. If I just let
> > the black box peel off some values, I don't know what I have lost. On
> > the other hand, if I import data and examine it with a summary
> > function, I may find that one woman has a height of 5.2 meters. I can
> > range check by looking up the Guinness Book of Records. It's an
> > outlier. I can estimate the probability of such a height.  Hmm, about
> > 4 standard deviations above the mean. It's an outlier. I can attempt a
> > Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2")
> > has been recorded as a metric value". It's not an outlier.
> >
> > The more R gravitates toward "black box" functions, the more some
> > users are encouraged to let them do the work.You pays your money and
> > you takes your chances.
> >
> > Jim
> >
> >
> > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at gmail.com>
> wrote:
> >> R has a very wide audience, clinical research, astronomy, psychology,
> and
> >> so on and so on.
> >> I would consider data analysis work to be three stages: data
> preparation,
> >> statistical analysis, and producing the report.
> >> This regards the process of getting the data ready for analysis and
> >> reporting, sometimes called "data cleaning" or "data munging" or "data
> >> wrangling".
> >>
> >> So as regards tools for data preparation, speaking to the highly diverse
> >> audience mentioned, here is my question:
> >>
> >> What do you want?
> >> Or are you already quite happy with the range of tools that is currently
> >> before you?
> >>
> >> [BTW,  I posed the same question last week to the r-devel list, and was
> >> advised that r-help might be a more suitable audience by one of the
> >> moderators.]
> >>
> >> Robert Wilkins
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list