[R] Row exclude

Val v@|kremk @end|ng |rom gm@||@com
Sun Jan 30 01:32:00 CET 2022

Hi  all,
Thank you so much for the useful help and many options that you gave me.
Sorry for the delay response,  I was away for a while

On Sat, Jan 29, 2022 at 3:35 PM Avi Gross via R-help <r-help using r-project.org>

> Rui has indeed improved my first attempt in several ways so my comments
> are now focused on another level. There is seemingly endless discussion
> here about what is base R. Questions as well as Answers that go beyond base
> R are often challenged and I understand why, even if I personally don't
> worry about it.
> As I see it, R has many levels, like many modern programming languages,
> and some are built-in by default, while others are add-ons of various kinds
> and some are now seen as more commonly used than others. Some here, and NOT
> ME, seem particularly annoyed by the concept of the tidyverse existing or
> the corporate nature of RSTUDIO. I say, the more the better as long as they
> are well-designed and robust and efficient enough.
> There are many ways you can use R in simple mode to the point where you do
> not even use vectors as intended but use loops to say add corresponding
> entries in two vectors one item at a time using an index, as you might do
> with earlier languages. That is perfectly valid in R, albeit not using the
> language as intended as A+B in R does that for you fairly trivially, albeit
> hiding a kind of loop being done behind the scenes. But if the two vectors
> are not the same length, it can lead to subtle errors if it recycles or
> broadcasts the shorter one as needed UNLESS that was intended.
> Like many languages, R has additional modes of a sort. it is very loosely
> Object-Oriented and some solutions to problems may make use of that or
> other features not always found in other languages such as being able to
> attach attributes of arbitrary nature to things. But someone taking a
> beginner course in R, or just using it in simple ways, generally does not
> know or care and being given a possible solution like that may not be very
> helpful.
> R is fully a functional programming language and experienced users, like
> Rui clearly is, can make serious use of many paradigms like map/reduce to
> create what often are quite abstract solutions that can be tailored to do
> all kinds of things by simply changing the functions invoked or in this
> case also the data invoked. I was tempted to use a variant of his solution
> using the pmap() function that I am familiar with but it is not base R, but
> part of the "purr" package which is in the not-appreciated-here package of
> packages called the tidyverse, LOL!
> Pmap can take an arbitrary data.frame and look at it one row at a time and
> apply a function that sees all the columns. That function can be written so
> it applies your logic to each column entry for that row that you wish and
> combines the calculations to return something like TRUE/FALSE. In this
> case, it could be code connecting use of a regular expression on each
> column entry combined by the usual logical connectives like AND and NOT
> (using R notation) to return a TRUE or FALSE that pmap then combines into a
> vector and you use that to index the data.frame to keep only valid rows.
> BUT, I reconsidered using it here as it is a tad advanced and not pure R.
> Nor do I claim it is better than what Rui and others could come up with. It
> is just not as simple as the case we are looking at.
> R has another facet that needs to be used carefully that significantly
> alters some approaches as compared to a language like Python which has a
> much nicer object-oriented set of tools but does not have some of the
> delayed evaluation R supports and that sometimes get in the way as some
> people expect them to be evaluated sooner, or at all. I see strengths and
> weaknesses and try to use a language suited for my needs that also uses it
> mostly as intended.
> I also ask if we have met the needs of the person who asked this question.
> If they do not reply and merely REPOST the same question with a shorter
> subject-line, then I suggest we all wasted our time trying. Proper
> etiquette, I might think, is to reply to some work show by others IN PUBLIC
> and especially to explain anything being asked by us and to let us know
> what worked for them or met their needs or show a portion of what code they
> finally implemented. Some of that may yet happen, but can anyone blame me
> for being a tad suspicious this time?
> I tend to be interested in deeper discussions and many are outside the
> scope of this forum. So I acknowledge that discussing alternate methods
> including more abstract ones using functional programming or other tricks,
> is a bit outside what is expected here.
> I want though to add one more idea. Can we agree that the user may have a
> more general concept to be considered here. That is the concept of having a
> data.frame where each column is purely numeric consisting of just 0 through
> 9 with perhaps no spaces, periods or commas or anything extraneous, OR
> purely alphabetic with no numerals allowed, and alphabetic in the same
> sense as Rui uses. Mind you, I do not see any reason for always using the
> current locale for something like names of people that may well be written
> with characters from another locale. I would think any string with all
> non-numeric characters might be allowed for the purpose.
> Can you write a function that accepts only pure cells that either have no
> numerals or no alphabeticals but not a mixture? The same function can
> initially be applied to all columns of a data.frame that only is supposed
> to contain columns of one kind or the other but not combinations. You might
> begin by reading in all data as character mode with perhaps extraneous
> white space stripped. You then apply the above function to identify any
> rows that contain a mixed alphanumeric item and eliminate all such rows.
> For consistency, you might examine the resulting data.frame and try to
> convert all columns to numeric. Any that fail conversion attempts  are left
> as character but may possibly have anomalies like one of more alphabetic
> items mixed into  an otherwise numeric set of entries. That might require
> another filter run per column to identify those and either remove more rows
> or replace the bad ones with NA or a default like 0 in what is then
> convertable to a numeric column.
> But my thought was that it is more complex to design something (as Rui
> did) that takes a list of intended column types, or a function that knows
> how to deal with each, as compared to an all-purpose function that just
> insists on purity at a local level and is a simpler program to write.
> Avious
> -----Original Message-----
> From: Rui Barradas <ruipbarradas using sapo.pt>
> To: Avi Gross <avigross using verizon.net>; dcarlson using tamu.edu <dcarlson using tamu.edu>;
> bgunter.4567 using gmail.com <bgunter.4567 using gmail.com>
> Cc: r-help using r-project.org <r-help using r-project.org>
> Sent: Sat, Jan 29, 2022 1:33 pm
> Subject: Re: [R] Row exclude
> Hello,
> Thanks for the comments, a few others inline.
> Às 18:04 de 29/01/2022, Avi Gross escreveu:
> > There are many creative ways to solve problems and some may get you in
> > trouble if you present them in class while even in some work situations,
> > they may be hard for most to understand, let alone maintain and make
> > changes.
> >
> > This group is amorphous enough that we have people who want "help" who
> > are new to the language, but also people who know plenty and encounter a
> > new kind of problem, and of course people who want to make use of what
> > they see as free labor.
> >
> > Rui presented a very interesting idea and I like some aspects. But if
> > presented to most people, they might have to start looking up things.
> >
> > But I admit I liked some of the ideas he uses and am adding them to my
> > bag of tricks. Some were overkill for this particular requirement but
> > that also makes them more general and useful.
> >
> > First, was the use of locale-independent regular expressions like
> > [[:alpha:]] that match any combination of [:lower:] and [:upper:] and
> > thus are not restricted to ASCII characters. Since I do lots of my
> > activities in languages other than English and well might include names
> > with characters not normally found in English, or not even using an
> > overlapping  alphabet, I can easily encounter items in the Name column
> > that might not match [A-Za-z] but will match with [:alpha:].
> >
> > I don't know if using [:digit:] has benefits over [0-9] and I do note
> > there was no requirement to match more complex numbers than integers so
> > no need to allow periods or scientific notation and so on.
> Yes, I used locale-independent regular expressions. It's a habit I
> aquired a while ago. It took some time to stop using character ranges
> but once gone I'm more comfortable with the use of classes like
> [:alpha:] and [:digit:].
> [After all my native language, (Portuguese) has
> cedillas(ES)/cedilhas(PT) and accented letters].
> >
> > Then there is the use of mapply. The more general version of the problem
> > presented would include a data.frame with any number of columns, where a
> > subset of the columns might need to be checked for conditions that vary
> > across the columns but may include some broad categories of conditions
> > that might be re-used. If all the conditions are regular expression
> > matches you can build, then you can extend the list Rui used to have
> > more items and also include expressions that always match so that some
> > columns are effectively ignored:
> >
> >
> > regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]", "[.*])
> >
> >
> > So this generalizes to N columns as long as you supply exactly N
> > patterns in the list, albeit mapply does recycle arguments if needed as
> > in the simplest case where you want all columns checked the same way.
> >
> > Rui then uses an anonymous function to pass to mapply() and that is a
> > newish feature added recently to R, I think. It was perhaps meant
> > specifically to be used with the new pipe symbol, but can be used
> > anywhere but perhaps not in older versions of R.
> >
> >
> > \(x, r) grepl(r, x)
> >
> No, the new anonymous function wasn't specifically meant to be used with
> the new pipe operator, it was meant to be a short-hand notation for
> anonymous functions and used interchangeably with the old notation.
> mapply(\(x, r), etc)
> mapply(function(x, r) etc)
> >
> > I note Rui also uses grepl() which returns a logical vector. I will show
> > my first attempt at the end where I used grep() to return index numbers
> > of matches instead. For this context, though, he made use of the fact
> > that mapply in this case returns a matrix of type logical:
> >
> > i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> >
> >> i
> >
> >        Name   Age Weight
> >
> > And since R treats TRUE as 1 and FALSE as 0, then summing the rows gives
> > you a small integer between 0 and the number of columns, inclusive, and
> > only rows with no TRUE in them are wanted for this purpose:
> And rowSums is a fast function.
> >
> >
> > dat1[rowSums(i) == 0L, ]
> >
> > All I all, nicely done, but not trivial to read without comments, LOL!
> >
> > And, yes, it could be made even more obscure as a one-liner.
> >
> > My first attempt was a bit more focused on the specific needs described.
> > I am not sure how the HTML destroyer in this mailing list might wreck
> > it, but I made it a two-statement version that is formatted on multiple
> > lines. An explanation first.
> >
> > I looked at using grep() on one column at a time to look for what should
> > NOT be there and ask it to invert the answer so it effectively tells me
> > which rows to keep. So it tests column 1 ($Name) to see if it has digits
> > in it and returns FALSE if it finds them which later means toss this
> > row. It returns TRUE if that entry, so far, makes the row valid. But
> > note since I am not using grepl() it does not return TRUE/FALSE at all.
> > Rather it returns index numbers of the ones that now inverted are TRUE.
> > What goes in is a vector of individual items from a column of the data.
> > What goes out is the indices of which ones I want to keep that can be
> > used to index the entire data.frame. Based on the ample data, it returns
> > 1:5 as row 6 has a digit in "Jack3".
> >
> >
> >    grep("[0-9]", dat1$Name, invert = TRUE)
> >
> >
> > Similarly, two other grep() statements test if the second and third
> > columns contain any characters in "[a-zA-Z]" and return a similar index
> > vector if they are OK.
> >
> > What I would then have are three numeric vectors, not a matrix. Each
> > contains a subset of all the indices:
> >
> >
> >> grep("[0-9]", dat1$Name, invert = TRUE)
> > [1] 1 2 3 4 5
> >> grep("[a-zA-Z]", dat1$Age, invert = TRUE)
> > [1] 1 2 3 5 6
> >> grep("[a-zA-Z]", dat1$Weight, invert = TRUE)
> > [1] 2 3 4 5 6
> >
> > This set of data was designed to toss out one of each column so they all
> > are of the same length but need not be. Like Rui, my condition for
> > deciding which rows to keep is that all three of the index vectors have
> > a particular entry. He summed them as logicals, but my choice has small
> > integers so the way I combine them to exclude any not in all three is to
> > use a sort of set intersect method. The one built-in to R only handles
> > two at a time so I nested two calls to intersect but in a more general
> > case, I would use some package (or build my own function) that handles
> > intersecting any number of such items.
> >
> > Here is the full code, minus the initialization.
> >
> >
> > rows.keep <-
> > intersect(intersect(grep("[0-9]", dat1$Name, invert = TRUE),
> >                      grep("[a-zA-Z]", dat1$Age, invert = TRUE)),
> >            grep("[a-zA-Z]", dat1$Weight, invert = TRUE))
> > result <- dat1[rows.keep,]
> >
> >
> Using the same idea, another two options, both with Reduce.
> The 1st uses Avi's grep and regex's, the latter could be the character
> classes "[[:alpha:]]" and "[[:digit:]]" but this code is inspired in
> his. The results are put on a list and Reduce intersects the list
> members. Then subsetting is as usual.
> The 2nd uses the fact that Mapis a wrapper for mapply that defaults to
> not simplifying its output. grep/invert will find the non-matches and
> Reduce intersects the result list, as above.
>  From ?Map:
> Map is a simple wrapper to mapply which does not attempt to simplify the
> result, similar to Common Lisp's mapcar (with arguments being recycled,
> however). Future versions may allow some control of the result type.
> # 1st
> grep_list <- list(
>    grep("[0-9]", dat1$Name, invert = TRUE),
>    grep("[a-zA-Z]", dat1$Age, invert = TRUE),
>    grep("[a-zA-Z]", dat1$Weight, invert = TRUE)
> )
> keep1 <- Reduce(intersect, grep_list)
> dat1[keep1,]
> # 2nd
> keep2 <- Map(\(x, r) grep(r, x, invert = TRUE), dat1, regex)
> keep2 <- Reduce(intersect, keep2)
> identical(keep1, keep2)
> #[1] TRUE
> Hope this helps,
> Rui Barradas
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Rui Barradas <ruipbarradas using sapo.pt>
> > To: David Carlson <dcarlson using tamu.edu>; Bert Gunter <
> bgunter.4567 using gmail.com>
> > Cc: r-help using R-project.org (r-help using r-project.org) <r-help using r-project.org>
> > Sent: Sat, Jan 29, 2022 3:46 am
> > Subject: Re: [R] Row exclude
> >
> > Hello,
> >
> > Getting creative, here is another way with mapply.
> >
> >
> > regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]")
> >
> > i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> > dat1[rowSums(i) == 0L, ]
> >
> > #  Name Age Weight
> > #2   Bob   25       142
> > #3 Carol   24       120
> > #5  Katy   35       160
> >
> >
> > Hope this helps,
> >
> > Rui Barradas
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

More information about the R-help mailing list