[R] Need fresh eyes to see what I'm missing

Tue Sep 14 18:41:38 CEST 2021

Rich,

I have to wonder about how your data was placed in the CSV file based on
what you report.

functions like read.table() (which is called by read.csv()) ultimately make
guesses about what number of columns to expect and what the contents are
likely to be. They may just examine the first N entries and make the most
compatible choice. The fact that it shows this:

'data.frame':	565675 obs. of  6 variables:
  $ year : chr  "2016" "2016" "2016" "2016" ...
  $ month: int  3 3 3 3 3 3 3 3 3 3 ...
  $ day  : int  3 3 3 3 3 3 3 3 3 3 ...
  $ hour : chr  "12" "12" "12" "12" ...
  $ min  : int  0 10 20 30 40 50 0 10 20 30 ...
  $ fps  : chr  "1.74" "1.75" "1.76" "1.81" ...

is odd. It suggests somewhere early in the data, it did not say 2016 or some
other entry  as an integer but as "2016" or a word like `missing` and not in
quotes.

Something similar seems to have happened with hour and fps but not the rest.

Nonetheless, you did convert back to what you wanted BUT if a single
anomalous entry remains then as.integer("missing") would return an NA and
as.double("missing") also an NA. So it is wise to check for any unexpected
numbers. If the source cannot be changed, then the R program can filter out
such cases from your data.frame in various ways.

Your way of reading the CSV in was this:

vel <- read.csv('../data/water/vel.dat', header = TRUE, sep = ',',
stringsAsFactors = FALSE)

The default is the options you added for header=TRUE and sep="," so that is
harmless. The default now is not to read in strings as Factors. But what you
did not include may be something you can look at given your data may be a
bit off. 

Without the underlying file, we can not trivially diagnose what may be wrong
in it. Do you get any error messages when reading in the file?  You can
specify additional arguments to read.csv() about what, if any, quoting
characters are used, what sequences should be recognized as an NA,
suggestions of what type each column should be assumed to be, what to do
with blank lines, what a comment looks like  and so on. 

One thing I sometimes have had to do is open the original CSV file in EXCEL
and examine it in various ways or even change it and save it again. That is
beyond the scope of this mailing list so if needed, ask me in private. You
have been working on this kind of stuff, but I assume often using other
tools outside R and dplyr.

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Rich Shepard
Sent: Tuesday, September 14, 2021 11:49 AM
To: R mailing list <r-help using r-project.org>
Subject: Re: [R] Need fresh eyes to see what I'm missing

On Tue, 14 Sep 2021, Bert Gunter wrote:

> Remove all your as.integer() and as.double() coercions. They are 
> unnecessary (unless you are preparing input for C code; also, all R 
> non-integers are double precision) and may be the source of your problems.

Bert,

When I remove coercions the script produces warnings like this:
1: In mean.default(fps, na.rm = TRUE) :
   argument is not numeric or logical: returning NA

and str(vel) displays this:
'data.frame':	565675 obs. of  6 variables:
  $ year : chr  "2016" "2016" "2016" "2016" ...
  $ month: int  3 3 3 3 3 3 3 3 3 3 ...
  $ day  : int  3 3 3 3 3 3 3 3 3 3 ...
  $ hour : chr  "12" "12" "12" "12" ...
  $ min  : int  0 10 20 30 40 50 0 10 20 30 ...
  $ fps  : chr  "1.74" "1.75" "1.76" "1.81" ...

so month, day, and min are recognized as integers but year, hour, and fps
are seen as characters. I don't understand why.

Regards,

Rich

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.