[R] Filtering an Entire Dataset based on Several Conditions
Rui Barradas
ru|pb@rr@d@@ @end|ng |rom @@po@pt
Mon May 9 19:54:31 CEST 2022
Hello,
My code seems to work with your data, except that the first column is
not to be scaled.
# file names
xlsfile <- file.path("~/dados", "trainFeatures42k.xls")
csvfile <- file.path("~/dados", "Normalized_Data.csv")
# read in the data files
df1 <- readxl::read_excel(xlsfile, col_names = FALSE)
df2 <- read.csv(csvfile)
# assign names to make all.equal happy
names(df1) <- sprintf("X%d", seq_len(ncol(df1)))
names(df2) <- sprintf("X%d", seq_len(ncol(df2)))
# the first column is not to be scaled
df1_norm <- scale(df1[-1])
# compare to the already scaled data from the Google Drive
# the data.frames are equal up to floating-point precision
identical(df2[-1], as.data.frame(df1_norm))
#[1] FALSE
all.equal(df2[-1], as.data.frame(df1_norm))
#[1] TRUE
# see if all values in each row are between -3 and 3
i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)
# filter and return a data.frame
df1_clean <- as.data.frame(df1_norm[i,])
dim(df1_clean)
#[1] 32494 60
See if you get the same results.
Hope this helps,
Rui Barradas
Às 17:44 de 09/05/2022, Paul Bernal escreveu:
> Dear Rui,
>
> I was trying to dput() the datasets I am working on, but since it is a
> bit large (42,000 rows by 60 columns) couldn´t retrieve all the
> structure of the data to include it here, so I am attaching a couple of
> files. One is the raw data (called trainFeatures42k), which is the data
> I need to normalize, and the other is normalized_Data, which is the data
> normalized (or at least I think I got to normalize it).
>
> Normalized_Data.csv
> <https://drive.google.com/file/d/143I1O710gAqWjzx48Gt1bwUbrG0mbpfa/view?usp=drive_web>
> trainFeatures42k.xls
> <https://drive.google.com/file/d/1deMzGMkJyeVsnRzTKirmm4VqIBRzbvzV/view?usp=drive_web>
>
> I have tried some of the code you and other friends from the community
> have kindly shared, but have not been able to filter values > -3 and < 3.
>
> Thank you all for your valuable help always.
> Best,
> Paul
>
> El lun, 9 may 2022 a las 4:22, Rui Barradas (<ruipbarradas using sapo.pt
> <mailto:ruipbarradas using sapo.pt>>) escribió:
>
> Hello,
>
> Something like this?
> First normalize the data.
> Then a apply loop creates a logical matrix giving which numbers are in
> the range -3 to 3.
> If they are all TRUE then their sum by rows is equal to the number of
> columns. This creates a logical index i.
> Use that index i to subset the scaled data set.
>
> # test data set, remove the Species column (not numeric)
> df1 <- iris[-5]
>
> df1_norm <- scale(df1)
> i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)
>
> # returns a matrix
> df1_norm[i, ]
>
> # returns a data.frame
> as.data.frame(df1_norm[i,])
>
>
> Hope this helps,
>
> Rui Barradas
>
> Às 09:23 de 09/05/2022, Paul Bernal escreveu:
> > Dear friends,
> >
> > I have a dataframe which every single (i,j) entry (i standing for
> ith row,
> > j for jth column) has been normalized (converted to z-scores).
> >
> > Now I want to filter or subset the dataframe so that I only end
> up with a a
> > dataframe containing only entries greater than -3 or less than 3.
> >
> > How could I accomplish this?
> >
> > Best,
> > Paul
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org <mailto:R-help using r-project.org> mailing list
> -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> <https://stat.ethz.ch/mailman/listinfo/r-help>
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> <http://www.R-project.org/posting-guide.html>
> > and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list