[R] R interpreting numeric field as a boolean field

Tue Jan 30 20:19:03 CET 2024

Incidentally, "didn't work" is not very useful information. Please tell us
exactly what error message or apparently aberrant result you received.
Also, what do you get from:

sapply(your_dataframe, "class")
nrow(your_dataframe)

(as I suspect what you think it is, isn't).

Cheers,
Bert

On Tue, Jan 30, 2024 at 11:01 AM Bert Gunter <bgunter.4567 using gmail.com> wrote:

> "I cannot change the data type from
> boolean to numeric. I tried doing dataset$my_field =
> as.numeric(dataset$my_field), I also tried to do dataset <-
> dataset[complete.cases(dataset), ], didn't work either. "
>
> Sorry, but all I can say is: huh?
>
> > dt <- data.frame(a = c(NA,NA, FALSE, TRUE), b = 1:4)
> > dt
>       a b
> 1    NA 1
> 2    NA 2
> 3 FALSE 3
> 4  TRUE 4
> > sapply(dt, class)
>         a         b
> "logical" "integer"
> > dt$a <- as.numeric(dt$a)
> > dt
>    a b
> 1 NA 1
> 2 NA 2
> 3  0 3
> 4  1 4
> > sapply(dt, class)
>         a         b
> "numeric" "integer"
>
> So either I'm missing something or you are. Happy to be corrected and
> chastised if the former.
>
> Cheers,
> Bert
>
>
> On Tue, Jan 30, 2024 at 10:41 AM Paul Bernal <paulbernal07 using gmail.com>
> wrote:
>
>> Dear friend Duncan,
>>
>> Thank you so much for your kind reply. Yes, that is exactly what is
>> happening, there are a lot of NA values at the start, so R assumes that
>> the
>> field is of type boolean. The challenge that I am facing is that I want to
>> read into R an Excel file that has many sheets (46 in this case) but I
>> wanted to combine all 46 sheets into a single dataframe (since the columns
>> are exactly the same for all 46 sheets). The rio package does this nicely,
>> the problem is that, once I have the full dataframe (which amounts to
>> roughly 2.98 million rows total), I cannot change the data type from
>> boolean to numeric. I tried doing dataset$my_field =
>> as.numeric(dataset$my_field), I also tried to do dataset <-
>> dataset[complete.cases(dataset), ], didn't work either.
>>
>> The only thing that worked for me was to take a single sheed and through
>> the read_excel function use the guess_max parameter and set it to a
>> sufficiently large number (a number >= to the total amount of the full
>> merged dataset). I want to automate the merging of the N number of Excel
>> sheets so that I don't have to be manually doing it. Unless there is a way
>> to accomplish something similar to what rio's package function import_list
>> does, that is able to keep the field's numeric data type nature.
>>
>> Cheers,
>> Paul
>>
>> El mar, 30 ene 2024 a las 12:23, Duncan Murdoch (<
>> murdoch.duncan using gmail.com>)
>> escribió:
>>
>> > On 30/01/2024 11:10 a.m., Paul Bernal wrote:
>> > > Dear friends,
>> > >
>> > > Hope you are doing well. I am currently using R version 4.3.2, and I
>> > have a
>> > > .xlsx file that has 46 sheets on it. I basically combined  all 46
>> sheets
>> > > and read them as a single dataframe in R using package rio.
>> > >
>> > > I read a solution using package readlx, as suggested in a
>> StackOverflow
>> > > discussion as follows:
>> > > df <- read_excel(path = filepath, sheet = sheet_name, guess_max =
>> > 100000).
>> > > Now, when you have so many sheets (46 in my case) in an Excel file,
>> the
>> > rio
>> > > methodology is more practical.
>> > >
>> > > This is what I did:
>> > > path =
>> > >
>> >
>> "C:/Users/myuser/Documents/DataScienceF/Forecast_and_Econometric_Analysis_FIGI
>> > > (4).xlsx"
>> > > figidat = import_list(path, rbind = TRUE) #here figidat refers to my
>> > dataset
>> > >
>> > > Now, it successfully imports and merges all records, however, some
>> fields
>> > > (despite being numeric), R interprets as a boolean field.
>> > >
>> > > Here is the structure of the field that is causing me problems (I
>> > apologize
>> > > for the length):
>> > > structure(list(StoreCharges = c(NA, NA, NA, NA, NA, NA, NA, NA,
>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>> > ...
>> > > FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, NA, NA,
>> > > FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
>> > > FALSE, FALSE, FALSE)), class = c("tbl_df", "tbl", "data.frame"
>> > > ), row.names = c(NA, -7033L))
>> > >
>> > > As you can see, when I do the dput, it gives me a bunch of TRUE and
>> FALSE
>> > > values, when in reality I have records with value $0, records with
>> > amounts
>> > >> $0 and also a bunch of blank records.
>> > >
>> > > Any help will be greatly appreciated.
>> >
>> > I don't know how read_excel() determines column types, but some
>> > functions look only at the first n rows to guess the type.  It appears
>> > you have a lot of NA values at the start.  That is a logical value, so
>> > that might be what is going wrong.
>> >
>> > In read.table() and related functions, you can specify the types of
>> > column explicitly.  It sounds as though that's what you should do if
>> > read_excel() offers that as a possibility.
>> >
>> > Duncan Murdoch
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

	[[alternative HTML version deleted]]