[R] splitting a string column into multiple columns faster

Sun Jun 9 01:08:40 CEST 2013

Hi Dimitri,
No problem.
I noticed that it is slower with more number of rows.  You could use data.table().
##1e6 rows
l1<- letters[1:10]
s1<-sapply(seq_along(l1),function(i) paste(rep(l1[i],3),collapse=""))
set.seed(24)
x2<-data.frame(x=paste(paste0(sample(s1,1e6,replace=TRUE),sample(1:15,1e6,replace=TRUE)),paste0(sample(s1,1e6,replace=TRUE),sample(1:15,1e6,replace=TRUE)),paste0(sample(s1,1e6,replace=TRUE),sample(1:15,1e6,replace=TRUE)),sep="_"),stringsAsFactors=FALSE)
system.time(resNew2<-data.frame(x=x2,read.table(text=gsub("[A-Za-z]","",x2[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE))
#
#   user  system elapsed 
#363.383   0.036 364.153 

library(data.table)
dt2<- data.table(x2)
system.time({
 dt2[,xNew:= gsub("[A-Za-z]","",x),]
 dt2[,V1:=unlist(strsplit(xNew,split="_"))[[1]],by=xNew]
 dt2[,V2:=unlist(strsplit(xNew,split="_"))[[2]],by=xNew]
 dt2[,V3:=unlist(strsplit(xNew,split="_"))[[3]],by=xNew]
 dt3<- subset(dt2,select=-2)
})
# user  system elapsed 
#  3.076   0.004   3.085 
dim(resNew2)
#[1] 1000000       4
 dim(dt3)
#[1] 1000000       4

 head(resNew2)
#                 x V1 V2 V3
#1  ccc12_ccc3_ggg8 12  3  8
#2  ccc8_ccc1_fff11  8  1 11
#3 hhh15_ggg2_hhh13 15  2 13
#4   fff9_bbb3_ccc9  9  3  9
#5  ggg4_eee2_jjj14  4  2 14
#6  jjj7_ddd9_bbb15  7  9 15

 head(dt3)
#                  x V1 V2 V3
#1:  ccc12_ccc3_ggg8 12  3  8
#2:  ccc8_ccc1_fff11  8  1 11
#3: hhh15_ggg2_hhh13 15  2 13
#4:   fff9_bbb3_ccc9  9  3  9
#5:  ggg4_eee2_jjj14  4  2 14
#6:  jjj7_ddd9_bbb15  7  9 15
A.K.

________________________________
From: Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com>
To: arun <smartpink111 at yahoo.com> 
Cc: R help <r-help at r-project.org> 
Sent: Saturday, June 8, 2013 5:59 PM
Subject: Re: [R] splitting a string column into multiple columns faster

Thanks again, guys!
Arun's method worked. I have over 270,000 rows and it took me 1 min.
Dimitri

On Sat, Jun 8, 2013 at 7:47 AM, Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com> wrote:

Thank you so much, Jorge and Arun - I'll give it a try!
>Dimitri
>
>
>
>On Fri, Jun 7, 2013 at 11:27 PM, arun <smartpink111 at yahoo.com> wrote:
>
>HI,
>>Tried it on 1e5 row dataset:
>>
>>l1<- letters[1:10]
>>s1<-sapply(seq_along(l1),function(i) paste(rep(l1[i],3),collapse=""))
>>set.seed(24)
>>x1<-data.frame(x=paste(paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),sep="_"),stringsAsFactors=FALSE)
>>system.time(resNew<-data.frame(x=x1,read.table(text=gsub("[A-Za-z]","",x1[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE))
>>#   user  system elapsed
>>#  2.712   0.016   2.732
>>
>>head(resNew)
>>
>>#                  x V1 V2 V3
>>#1  ccc12_ggg2_jjj14 12  2 14
>>#2  ccc7_ddd15_aaa11  7 15 11
>>#3 hhh12_ddd14_fff12 12 14 12
>>#4  fff11_bbb15_aaa6 11 15  6
>>#5   ggg12_ccc9_ggg8 12  9  8
>>#6   jjj8_eee12_eee4  8 12  4
>>
>>
>>A.K.
>>
>>
>>----- Original Message -----
>>
>>From: arun <smartpink111 at yahoo.com>
>>To: Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com>
>>Cc: R help <r-help at r-project.org>
>>Sent: Friday, June 7, 2013 11:00 PM
>>Subject: Re: [R] splitting a string column into multiple columns faster
>>
>>HI,
>>May be this helps:
>>
>>res<-data.frame(x=x,read.table(text=gsub("[A-Za-z]","",x[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE)
>>res
>>#               x V1 V2 V3
>>#1 aaa1_bbb1_ccc3  1  1  3
>>#2 aaa2_bbb3_ccc2  2  3  2
>>#3 aaa3_bbb2_ccc1  3  2  1
>>A.K.
>>
>>----- Original Message -----
>>From: Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com>
>>To: r-help <r-help at r-project.org>
>>Cc:
>>Sent: Friday, June 7, 2013 9:24 PM
>>Subject: [R] splitting a string column into multiple columns faster
>>
>>Hello!
>>
>>I have a column in my data frame that I have to split: I have to distill
>>the numbers from the text. Below is my example and my solution.
>>
>>x<-data.frame(x=c("aaa1_bbb1_ccc3","aaa2_bbb3_ccc2","aaa3_bbb2_ccc1"))
>>x
>>library(stringr)
>>out<-as.data.frame(str_split_fixed(x$x,"aaa",2))
>>out2<-as.data.frame(str_split_fixed(out$V2,"_bbb",2))
>>out3<-as.data.frame(str_split_fixed(out2$V2,"_ccc",2))
>>result<-cbind(x,out2[1],out3)
>>result
>>My problem is:
>>str_split.fixed is relatively slow. In my real data frame I have over
>>80,000 rows so that it takes almost 30 seconds to run just one line (like
>>out<-... above)
>>And it's even slower because I have to do it step-by-step many times.
>>
>>Any way to do it by specifying all 3 delimiters at once
>>("aaa","_bbb","_ccc") and then split it in one swoop into a data frame with
>>several columns?
>>
>>Thanks a lot for any pointers!
>>
>>--
>>Dimitri Liakhovitski
>>
>>    [[alternative HTML version deleted]]
>>
>>______________________________________________
>>R-help at r-project.org mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>-- 
>
>Dimitri Liakhovitski

-- 

Dimitri Liakhovitski