[R] Split

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Wed Sep 23 02:19:26 CEST 2020


Oh, if efficiency is a consideration, then my code is about 15 times as
fast as Rui's:
> F2 <- F1[rep(1:5,1e6),]  ## 5 million rows
##Rui's
> system.time({
+     F2$Y1 <- +grepl("_", F2$text)
+     tmp <- strsplit(as.character(F2$text), "_")
+     tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
+     tmp <- do.call(rbind, tmp)
+     colnames(tmp) <- c("X1", "X2")
+     F2 <- cbind(F2[-3], tmp)    # remove the original column
+ })
   user  system elapsed
 20.072   0.625  20.786

## my version
> system.time({
+     wh <- grep("_",F2$text, fixed = TRUE, invert = TRUE)
+     F2[wh,"text"] <- paste(F2[wh,"text"],".",sep = "_")
+     z <- unlist(strsplit(F1$text,"_"))
+     F2 <- cbind(F2, matrix(z, ncol = 2, byrow = TRUE))
+     F2
+ })
   user  system elapsed
  1.256   0.019   1.281

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Sep 22, 2020 at 5:04 PM Val <valkremk using gmail.com> wrote:

> Thank you all for the help!
>
> LMH, Yes I would like to see the alternative.  I am using this for a
> large data set and if the  alternative is more efficient than this
> then I would be happy.
>
> On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <bgunter.4567 using gmail.com>
> wrote:
> >
> > To be clear, I think Rui's solution is perfectly fine and probably
> better than what I offer below. But just for fun, I wanted to do it without
> the lapply().  Here is one way. I think my comments suffice to explain.
> >
> > > ## which are the  non "_" indices?
> > > wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
> > > ## paste "_." to these
> > > F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
> > > ## Now strsplit() and unlist() them to get a vector
> > > z <- unlist(strsplit(F1$text, "_"))
> > > ## now cbind() to the data frame
> > > F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
> > > F1
> >   ID1 ID2   text    1  2
> > 1  A1  B1 NONE_. NONE  .
> > 2  A1  B1  cf_12   cf 12
> > 3  A1  B1 NONE_. NONE  .
> > 4  A2  B2  X2_25   X2 25
> > 5  A2  B3  fd_15   fd 15
> > >## You can change the names of the 2 columns yourself
> >
> > Cheers,
> > Bert
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <ruipbarradas using sapo.pt>
> wrote:
> >>
> >> Hello,
> >>
> >> A base R solution with strsplit, like in your code.
> >>
> >> F1$Y1 <- +grepl("_", F1$text)
> >>
> >> tmp <- strsplit(as.character(F1$text), "_")
> >> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
> >> tmp <- do.call(rbind, tmp)
> >> colnames(tmp) <- c("X1", "X2")
> >> F1 <- cbind(F1[-3], tmp)    # remove the original column
> >> rm(tmp)
> >>
> >> F1
> >> #  ID1 ID2 Y1   X1 X2
> >> #1  A1  B1  0 NONE  .
> >> #2  A1  B1  1   cf 12
> >> #3  A1  B1  0 NONE  .
> >> #4  A2  B2  1   X2 25
> >> #5  A2  B3  1   fd 15
> >>
> >>
> >> Note that cbind dispatches on F1, an object of class "data.frame".
> >> Therefore it's the method cbind.data.frame that is called and the result
> >> is also a df, though tmp is a "matrix".
> >>
> >>
> >> Hope this helps,
> >>
> >> Rui Barradas
> >>
> >>
> >> Às 20:07 de 22/09/20, Rui Barradas escreveu:
> >> > Hello,
> >> >
> >> > Something like this?
> >> >
> >> >
> >> > F1$Y1 <- +grepl("_", F1$text)
> >> > F1 <- F1[c(1, 2, 4, 3)]
> >> > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill
> =
> >> > "right")
> >> > F1
> >> >
> >> >
> >> > Hope this helps,
> >> >
> >> > Rui Barradas
> >> >
> >> > Às 19:55 de 22/09/20, Val escreveu:
> >> >> HI All,
> >> >>
> >> >> I am trying to create   new columns based on another column string
> >> >> content. First I want to identify rows that contain a particular
> >> >> string.  If it contains, I want to split the string and create two
> >> >> variables.
> >> >>
> >> >> Here is my sample of data.
> >> >> F1<-read.table(text="ID1  ID2  text
> >> >> A1 B1   NONE
> >> >> A1 B1   cf_12
> >> >> A1 B1   NONE
> >> >> A2 B2   X2_25
> >> >> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
> >> >> If the variable "text" contains this "_" I want to create an
> indicator
> >> >> variable as shown below
> >> >>
> >> >> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
> >> >>
> >> >>
> >> >> Then I want to split that string in to two, before "_" and after "_"
> >> >> and create two variables as shown below
> >> >> x1= strsplit(as.character(F1$text),'_',2)
> >> >>
> >> >> My problem is how to combine this with the original data frame. The
> >> >> desired  output is shown   below,
> >> >>
> >> >>
> >> >> ID1 ID2  Y1   X1    X2
> >> >> A1  B1    0   NONE   .
> >> >> A1  B1   1    cf        12
> >> >> A1  B1   0  NONE   .
> >> >> A2  B2   1    X2    25
> >> >> A2  B3   1    fd    15
> >> >>
> >> >> Any help?
> >> >> Thank you.
> >> >>
> >> >> ______________________________________________
> >> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> PLEASE do read the posting guide
> >> >> http://www.R-project.org/posting-guide.html
> >> >> and provide commented, minimal, self-contained, reproducible code.
> >> >>
> >> >
> >> > ______________________________________________
> >> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> > http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list