[R] concatenating columns in data.frame
Micha Silver
t@v|b@r @end|ng |rom gm@||@com
Sat Jul 3 13:26:08 CEST 2021
Again thanks for carrying on this thread with your additional,
informative comments, as well as the welcome humor.
On 7/3/21 2:59 AM, Jeff Newmiller wrote:
> I am very agnostic about tidyverse/base R. However, the complexity of
> setting up NSE functions is often simply not needed, and I encounter
> so many people who simply disregard base R as being too outdated so
> that they never learn how simple solutions in R can be. The contrast
> between your solution and Bert's was... perhaps informative, but a
> nuclear bomb where an axe was sufficient.
>
> On Fri, 2 Jul 2021, Avi Gross via R-help wrote:
>
>> I know what you mean Jeff. Yes I am very familiar with base R
>> techniques. What I had hoped for was to do two things that some of
>> the other methods mentioned do that ended up bringing two data.frames
>> together as part of the solution.
>>
>> Much of what I used is now standard R. I was looking at the accessory
>> functions now commonly used in dplyr that let you dynamically select
>> which columns to work with like begins_with() to choose. Sadly, they
>> seem to work on a top-level but not easily within a call to something
>> like paste(...) where they are not evaluated in the way I want.
>>
>> But the odd method I tried can also be used in standard R with a bit
>> of work. You can create a function without using dplyr that takes
>> your df and uses it to concatenate and end with something like:
>>
>> df$new_col <- do_something(df, selected_cols)
>>
>> That too adds a column without the need to merge larger structures
>> explicitly..
>>
>> But your other point is a tad religious in a sense. I happen to
>> prefer learning a core language first then looking at enhancement
>> opportunities. But at some point, if teaching someone new who wants
>> to focus on getting a job done simply but not necessarily repeatedly
>> or in some ideal way, it is best to do things in a way that their
>> mind flows better.
>>
>> Many things in the tidyverse are redundant with base R or just "fix"
>> inconsistencies like making sure the first argument is always the
>> same. But many add substantially to doing things in a more
>> step-by-step manner.
>>
>> I do not worship the base language as it first came out or even as it
>> has evolved. I do like to know what choices I have and pick and
>> choose among them as needed. Of course a forum like this is more
>> about base R than otherwise and I acknowledge that. Still, the ":="
>> operator is now base R. There is a new pipeline operator "|>" in base
>> R. Some ideas, good or otherwise, do get in eventually.
>>
>> I started doing graphs using base R as in the plot() command. It was
>> adequate but I wanted better. So I learned about Lattice and various
>> packages and eventually ggplot. I can now do things I barely imagined
>> before and am still learning that there is much more I can do with
>> packages underneath much of the magic and also additional packages
>> layered above it, in some sense. So I do not approach that with an
>> either-or mentality either.
>>
>> Note I am not really talking about just R. I have similar issues with
>> other languages I program in such as Python. None of them were
>> created fully-formed and many had to add huge amounts to adapt to
>> additional wants and needs. Base R for me is often inadequate. But so
>> what?
>>
>> The task being asked for in this thread in isolation, indeed may not
>> be done any better using packages. However, if it is part of a larger
>> set of tasks that can be pipelined, it may well be and I personally
>> was wondering if there was a way in dplyr. There probably is a much
>> better way than I assembled if I only knew about it, and if not, they
>> may add this kind of indirection in a future release if deemed worthy
>> of doing. I have gone back to programs I did years ago with humungous
>> amounts of code using what I knew then and reducing it drastically
>> now that I can tell a function to select say all my column names that
>> end in .orig and apply a set of functions to them with output going
>> to the base name followed by .mean and .sd and so on. All that can
>> often be done in one or two lines of code where previously I had to
>> do 18 near repetitions of each part and then another and another.
>> That used a limited form of dynamism.
>>
>> Be that as it may I think the requester has enough info and we can
>> move on.
>>
>> -----Original Message-----
>> From: Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
>> Sent: Friday, July 2, 2021 1:03 AM
>> To: Avi Gross <avigross using verizon.net>; Avi Gross via R-help
>> <r-help using r-project.org>; R-help using r-project.org
>> Subject: Re: [R] concatenating columns in data.frame
>>
>> I use parts of the tidyverse frequently, but this post is the best
>> argument I can imagine for learning base R techniques.
>>
>> On July 1, 2021 8:41:06 PM PDT, Avi Gross via R-help
>> <r-help using r-project.org> wrote:
>>> Micha,
>>>
>>> Others have provided ways in standard R so I will contribute a somewhat
>>> odd solution using the dplyr and related packages in the tidyverse
>>> including a sample data.frame/tibble I made. It requires newer versions
>>> of R and other packages as it uses some fairly esoteric features
>>> including "the big bang" and the new ":=" operator and more.
>>>
>>> You can use your own data with whatever columns you need, of course.
>>>
>>> The goal is to have umpteen columns in the data that you want to add an
>>> additional columns to an existing tibble that is the result of
>>> concatenating the rowwise contents of a dynamically supplied vector of
>>> column names in quotes. First we need something to work with so here is
>>> a sample:
>>>
>>> #--start
>>> # load required packages, or a bunch at once!
>>> library(tidyverse)
>>>
>>> # Pick how many rows you want. For a demo, 3 is plenty N <- 3
>>>
>>> # Make a sample tibble with N rows and the following 4 columns mydf <-
>>> tibble(alpha = 1:N,
>>> beta=letters[1:N],
>>> gamma = N:1,
>>> delta = month.abb[1:N])
>>>
>>> # show the original tibble
>>> print(mydf)
>>> #--end
>>>
>>> In flat text mode, here is the output:
>>>
>>>> print(mydf)
>>> # A tibble: 3 x 4
>>> alpha beta gamma delta
>>> <int> <chr> <int> <chr>
>>> 1 1 a 3 Jan
>>> 2 2 b 2 Feb
>>> 3 3 c 1 Mar
>>>
>>> Now I want to make a function that is used instead of the mutate verb.
>>> I made a weird one-liner that is a tad hard to explain so first let me
>>> mention the requirements.
>>>
>>> It will take a first argument that is a tibble and in a pipeline this
>>> would be passed invisibly.
>>> The second required argument is a vector or list containing the names
>>> of the columns as strings. A column can be re-used multiple times.
>>> The third optional argument is what to name the new column with a
>>> default if omitted.
>>> The fourth optional argument allows you to choose a different separator
>>> than "" if you wish.
>>>
>>> The function should be usable in a pipeline on both sides so it should
>>> also return the input tibble with an extra column to the output.
>>>
>>> Here is the function:
>>>
>>> my_mutate <- function(df, columns, colnew="concatenated", sep=""){
>>> df %>%
>>> mutate( "{colnew}" := paste(!!!rlang::syms(columns), sep = sep )) }
>>>
>>> Yes, the above can be done inline as a long one-liner:
>>>
>>> my_mutate <- function(df, columns, colnew="concatenated", sep="")
>>> mutate(df, "{colnew}" := paste(!!!rlang::syms(columns), sep = sep ))
>>>
>>> Here are examples of it running:
>>>
>>>
>>>> choices <- c("beta", "delta", "alpha", "delta") mydf %>%
>>>> my_mutate(choices, "me2")
>>> # A tibble: 3 x 5
>>> alpha beta gamma delta me2
>>> <int> <chr> <int> <chr> <chr>
>>> 1 1 a 3 Jan aJan1Jan
>>> 2 2 b 2 Feb bFeb2Feb
>>> 3 3 c 1 Mar cMar3Mar
>>>> mydf %>% my_mutate(choices, "me2",":")
>>> # A tibble: 3 x 5
>>> alpha beta gamma delta me2
>>> <int> <chr> <int> <chr> <chr>
>>> 1 1 a 3 Jan a:Jan:1:Jan
>>> 2 2 b 2 Feb b:Feb:2:Feb
>>> 3 3 c 1 Mar c:Mar:3:Mar
>>>> mydf %>% my_mutate(c("beta", "beta", "gamma", "gamma", "delta",
>>>> "alpha"))
>>> # A tibble: 3 x 5
>>> alpha beta gamma delta concatenated
>>> <int> <chr> <int> <chr> <chr>
>>> 1 1 a 3 Jan aa33Jan1
>>> 2 2 b 2 Feb bb22Feb2
>>> 3 3 c 1 Mar cc11Mar3
>>>> mydf %>% my_mutate(list("beta", "beta", "gamma", "gamma", "delta",
>>>> "alpha"))
>>> # A tibble: 3 x 5
>>> alpha beta gamma delta concatenated
>>> <int> <chr> <int> <chr> <chr>
>>> 1 1 a 3 Jan aa33Jan1
>>> 2 2 b 2 Feb bb22Feb2
>>> 3 3 c 1 Mar cc11Mar3
>>>> mydf %>% my_mutate(columns=list("alpha", "beta", "gamma", "delta",
>>>> "gamma", "beta", "alpha"),
>>> + sep="/*/",
>>> + colnew="NewRandomNAME"
>>> + )
>>> # A tibble: 3 x 5
>>> alpha beta gamma delta NewRandomNAME
>>> <int> <chr> <int> <chr> <chr>
>>> 1 1 a 3 Jan 1/*/a/*/3/*/Jan/*/3/*/a/*/1
>>> 2 2 b 2 Feb 2/*/b/*/2/*/Feb/*/2/*/b/*/2
>>> 3 3 c 1 Mar 3/*/c/*/1/*/Mar/*/1/*/c/*/3
>>>
>>> Does this meet your normal need? Just to show it works in a pipeline,
>>> here is a variant:
>>>
>>> mydf %>%
>>> tail(2) %>%
>>> my_mutate(c("beta", "beta"), "betabeta") %>%
>>> print() %>%
>>> my_mutate(list("alpha", "betabeta", "gamma"),
>>> "buildson",
>>> "&")
>>>
>>> The above only keeps the last two lines of the tibble, makes a double
>>> copy of "beta" under a new name, prints the intermediate result,
>>> continues to make another concatenation using the variable created
>>> earlier then prints the result:
>>>
>>> Here is the run:
>>>
>>>> mydf %>%
>>> + tail(2) %>%
>>> + my_mutate(c("beta", "beta"), "betabeta") %>%
>>> + print() %>%
>>> + my_mutate(list("alpha", "betabeta", "gamma"),
>>> + "buildson",
>>> + "&")
>>> # A tibble: 2 x 5
>>> alpha beta gamma delta betabeta
>>> <int> <chr> <int> <chr> <chr>
>>> 1 2 b 2 Feb bb
>>> 2 3 c 1 Mar cc
>>> # A tibble: 2 x 6
>>> alpha beta gamma delta betabeta buildson
>>> <int> <chr> <int> <chr> <chr> <chr>
>>> 1 2 b 2 Feb bb 2&bb&2
>>> 2 3 c 1 Mar cc 3&cc&1
>>>
>>> As to how the darn function works, that was a learning experience for
>>> me to build using features I have not had occasion to use. If anyone
>>> remains interested, read on.
>>>
>>> The following needs newish features:
>>>
>>> "{colnew}" := SOMETHING
>>>
>>> The colon-equals operator in newer R/dplyr can be sort of used in an
>>> odd way that allows the name of the variable to be in quotes and in
>>> brackets akin to the way glue() does it. The variable colnew is
>>> evaluated and substituted so the name used for the column is now
>>> dynamic.
>>>
>>> The function does a paste using this:
>>>
>>> !!!rlang::syms(columns)
>>>
>>> The problem is paste() wants multiple arguments and we have a single
>>> argument that is either a vector or another kind of vector called a
>>> list. The trick is to convert the vector into symbols then use "!!!" to
>>> convert something like 'c("alpha", "beta", "gamma")' into something
>>> more like ' "alpha", "beta", "gamma" ' so that paste sees them as
>>> multiple arguments to concatenate in vector fashion.
>>>
>>> And, the function is not polished but I am sure you can all see some of
>>> what is needed like checking the arguments for validity, including not
>>> having a name for the new column that clashes with existing column
>>> names, doing something sane if no columns to concatenate are offered
>>> and so on.
>>>
>>> Just showing a different approach. The base R methods are fine.
>>>
>>> - Avi
>>>
>>> -----Original Message-----
>>> From: R-help <r-help-bounces using r-project.org> On Behalf Of Micha Silver
>>> Sent: Thursday, July 1, 2021 10:36 AM
>>> To: R-help using r-project.org
>>> Subject: [R] concatenating columns in data.frame
>>>
>>> I need to create a new data.frame column as a concatenation of existing
>>> character columns. But the number and name of the columns to
>>> concatenate needs to be passed in dynamically. The code below does what
>>> I want, but seems very clumsy. Any suggestions how to improve?
>>>
>>>
>>> df = data.frame("A"=sample(letters, 10), "B"=sample(letters, 10),
>>> "C"=sample(letters,10), "D"=sample(letters, 10))
>>>
>>> # Which columns to concat:
>>>
>>> use_columns = c("D", "B")
>>>
>>>
>>> UpdateCombo = function(df, use_columns) {
>>> use_df = df[, use_columns]
>>> combo_list = lapply(1:nrow(use_df), function(r) {
>>> r_combo = paste(use_df[r,], collapse="_")
>>> return(data.frame("Combo" = r_combo))
>>> })
>>> combo = do.call(rbind, combo_list)
>>>
>>> names(combo) = "Combo"
>>>
>>> return(combo)
>>>
>>> }
>>>
>>>
>>> combo_col = UpdateCombo(df, use_columns)
>>>
>>> df_combo = do.call(cbind, list(df, combo_col))
>>>
>>>
>>> Thanks
>>>
>>>
>>> --
>>> Micha Silver
>>> Ben Gurion Univ.
>>> Sde Boker, Remote Sensing Lab
>>> cell: +972-523-665918
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
>
> Jeff Newmiller The ..... ..... Go
> Live...
> DCN:<jdnewmil using dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
> Live: OO#.. Dead: OO#.. Playing
> Research Engineer (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#.
> rocks...1k
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Micha Silver
Ben Gurion Univ.
Sde Boker, Remote Sensing Lab
cell: +972-523-665918
More information about the R-help
mailing list