[R] Filtering an Entire Dataset based on Several Conditions

Mon May 9 19:54:31 CEST 2022

Hello,

My code seems to work with your data, except that the first column is 
not to be scaled.

# file names
xlsfile <- file.path("~/dados", "trainFeatures42k.xls")
csvfile <- file.path("~/dados", "Normalized_Data.csv")
# read in the data files
df1 <- readxl::read_excel(xlsfile, col_names = FALSE)
df2 <- read.csv(csvfile)
# assign names to make all.equal happy
names(df1) <- sprintf("X%d", seq_len(ncol(df1)))
names(df2) <- sprintf("X%d", seq_len(ncol(df2)))

# the first column is not to be scaled
df1_norm <- scale(df1[-1])
# compare to the already scaled data from the Google Drive
# the data.frames are equal up to floating-point precision
identical(df2[-1], as.data.frame(df1_norm))
#[1] FALSE
all.equal(df2[-1], as.data.frame(df1_norm))
#[1] TRUE

# see if all values in each row are between -3 and 3
i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)

# filter and return a data.frame
df1_clean <- as.data.frame(df1_norm[i,])
dim(df1_clean)
#[1] 32494    60

See if you get the same results.

Hope this helps,

Rui Barradas

Às 17:44 de 09/05/2022, Paul Bernal escreveu:
> Dear Rui,
> 
> I was trying to dput() the datasets I am working on, but since it is a 
> bit large (42,000 rows by 60 columns) couldn´t retrieve all the 
> structure of the data to include it here, so I am attaching a couple of 
> files. One is the raw data (called trainFeatures42k), which is the data 
> I need to normalize, and the other is normalized_Data, which is the data 
> normalized (or at least I think I got to normalize it).
> 
> Normalized_Data.csv 
> <https://drive.google.com/file/d/143I1O710gAqWjzx48Gt1bwUbrG0mbpfa/view?usp=drive_web>
> trainFeatures42k.xls 
> <https://drive.google.com/file/d/1deMzGMkJyeVsnRzTKirmm4VqIBRzbvzV/view?usp=drive_web>
> 
> I have tried some of the code you and other friends from the community 
> have kindly shared, but have not been able to filter values > -3 and < 3.
> 
> Thank you all for your valuable help always.
> Best,
> Paul
> 
> El lun, 9 may 2022 a las 4:22, Rui Barradas (<ruipbarradas using sapo.pt 
> <mailto:ruipbarradas using sapo.pt>>) escribió:
> 
>     Hello,
> 
>     Something like this?
>     First normalize the data.
>     Then a apply loop creates a logical matrix giving which numbers are in
>     the range -3 to 3.
>     If they are all TRUE then their sum by rows is equal to the number of
>     columns. This creates a logical index i.
>     Use that index i to subset the scaled data set.
> 
>     # test data set, remove the Species column (not numeric)
>     df1 <- iris[-5]
> 
>     df1_norm <- scale(df1)
>     i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)
> 
>     # returns a matrix
>     df1_norm[i, ]
> 
>     # returns a data.frame
>     as.data.frame(df1_norm[i,])
> 
> 
>     Hope this helps,
> 
>     Rui Barradas
> 
>     Às 09:23 de 09/05/2022, Paul Bernal escreveu:
>      > Dear friends,
>      >
>      > I have a dataframe which every single (i,j) entry (i standing for
>     ith row,
>      > j for jth column) has been normalized (converted to z-scores).
>      >
>      > Now I want to filter or subset the dataframe so that I only end
>     up with a a
>      > dataframe containing only entries greater than -3 or less than 3.
>      >
>      > How could I accomplish this?
>      >
>      > Best,
>      > Paul
>      >
>      >       [[alternative HTML version deleted]]
>      >
>      > ______________________________________________
>      > R-help using r-project.org <mailto:R-help using r-project.org> mailing list
>     -- To UNSUBSCRIBE and more, see
>      > https://stat.ethz.ch/mailman/listinfo/r-help
>     <https://stat.ethz.ch/mailman/listinfo/r-help>
>      > PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>      > and provide commented, minimal, self-contained, reproducible code.
>