[R] splitting a string column into multiple columns faster
arun
smartpink111 at yahoo.com
Sun Jun 9 01:08:40 CEST 2013
Hi Dimitri,
No problem.
I noticed that it is slower with more number of rows. You could use data.table().
##1e6 rows
l1<- letters[1:10]
s1<-sapply(seq_along(l1),function(i) paste(rep(l1[i],3),collapse=""))
set.seed(24)
x2<-data.frame(x=paste(paste0(sample(s1,1e6,replace=TRUE),sample(1:15,1e6,replace=TRUE)),paste0(sample(s1,1e6,replace=TRUE),sample(1:15,1e6,replace=TRUE)),paste0(sample(s1,1e6,replace=TRUE),sample(1:15,1e6,replace=TRUE)),sep="_"),stringsAsFactors=FALSE)
system.time(resNew2<-data.frame(x=x2,read.table(text=gsub("[A-Za-z]","",x2[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE))
#
# user system elapsed
#363.383 0.036 364.153
library(data.table)
dt2<- data.table(x2)
system.time({
dt2[,xNew:= gsub("[A-Za-z]","",x),]
dt2[,V1:=unlist(strsplit(xNew,split="_"))[[1]],by=xNew]
dt2[,V2:=unlist(strsplit(xNew,split="_"))[[2]],by=xNew]
dt2[,V3:=unlist(strsplit(xNew,split="_"))[[3]],by=xNew]
dt3<- subset(dt2,select=-2)
})
# user system elapsed
# 3.076 0.004 3.085
dim(resNew2)
#[1] 1000000 4
dim(dt3)
#[1] 1000000 4
head(resNew2)
# x V1 V2 V3
#1 ccc12_ccc3_ggg8 12 3 8
#2 ccc8_ccc1_fff11 8 1 11
#3 hhh15_ggg2_hhh13 15 2 13
#4 fff9_bbb3_ccc9 9 3 9
#5 ggg4_eee2_jjj14 4 2 14
#6 jjj7_ddd9_bbb15 7 9 15
head(dt3)
# x V1 V2 V3
#1: ccc12_ccc3_ggg8 12 3 8
#2: ccc8_ccc1_fff11 8 1 11
#3: hhh15_ggg2_hhh13 15 2 13
#4: fff9_bbb3_ccc9 9 3 9
#5: ggg4_eee2_jjj14 4 2 14
#6: jjj7_ddd9_bbb15 7 9 15
A.K.
________________________________
From: Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com>
To: arun <smartpink111 at yahoo.com>
Cc: R help <r-help at r-project.org>
Sent: Saturday, June 8, 2013 5:59 PM
Subject: Re: [R] splitting a string column into multiple columns faster
Thanks again, guys!
Arun's method worked. I have over 270,000 rows and it took me 1 min.
Dimitri
On Sat, Jun 8, 2013 at 7:47 AM, Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com> wrote:
Thank you so much, Jorge and Arun - I'll give it a try!
>Dimitri
>
>
>
>On Fri, Jun 7, 2013 at 11:27 PM, arun <smartpink111 at yahoo.com> wrote:
>
>HI,
>>Tried it on 1e5 row dataset:
>>
>>l1<- letters[1:10]
>>s1<-sapply(seq_along(l1),function(i) paste(rep(l1[i],3),collapse=""))
>>set.seed(24)
>>x1<-data.frame(x=paste(paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),sep="_"),stringsAsFactors=FALSE)
>>system.time(resNew<-data.frame(x=x1,read.table(text=gsub("[A-Za-z]","",x1[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE))
>># user system elapsed
>># 2.712 0.016 2.732
>>
>>head(resNew)
>>
>># x V1 V2 V3
>>#1 ccc12_ggg2_jjj14 12 2 14
>>#2 ccc7_ddd15_aaa11 7 15 11
>>#3 hhh12_ddd14_fff12 12 14 12
>>#4 fff11_bbb15_aaa6 11 15 6
>>#5 ggg12_ccc9_ggg8 12 9 8
>>#6 jjj8_eee12_eee4 8 12 4
>>
>>
>>A.K.
>>
>>
>>----- Original Message -----
>>
>>From: arun <smartpink111 at yahoo.com>
>>To: Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com>
>>Cc: R help <r-help at r-project.org>
>>Sent: Friday, June 7, 2013 11:00 PM
>>Subject: Re: [R] splitting a string column into multiple columns faster
>>
>>HI,
>>May be this helps:
>>
>>res<-data.frame(x=x,read.table(text=gsub("[A-Za-z]","",x[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE)
>>res
>># x V1 V2 V3
>>#1 aaa1_bbb1_ccc3 1 1 3
>>#2 aaa2_bbb3_ccc2 2 3 2
>>#3 aaa3_bbb2_ccc1 3 2 1
>>A.K.
>>
>>----- Original Message -----
>>From: Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com>
>>To: r-help <r-help at r-project.org>
>>Cc:
>>Sent: Friday, June 7, 2013 9:24 PM
>>Subject: [R] splitting a string column into multiple columns faster
>>
>>Hello!
>>
>>I have a column in my data frame that I have to split: I have to distill
>>the numbers from the text. Below is my example and my solution.
>>
>>x<-data.frame(x=c("aaa1_bbb1_ccc3","aaa2_bbb3_ccc2","aaa3_bbb2_ccc1"))
>>x
>>library(stringr)
>>out<-as.data.frame(str_split_fixed(x$x,"aaa",2))
>>out2<-as.data.frame(str_split_fixed(out$V2,"_bbb",2))
>>out3<-as.data.frame(str_split_fixed(out2$V2,"_ccc",2))
>>result<-cbind(x,out2[1],out3)
>>result
>>My problem is:
>>str_split.fixed is relatively slow. In my real data frame I have over
>>80,000 rows so that it takes almost 30 seconds to run just one line (like
>>out<-... above)
>>And it's even slower because I have to do it step-by-step many times.
>>
>>Any way to do it by specifying all 3 delimiters at once
>>("aaa","_bbb","_ccc") and then split it in one swoop into a data frame with
>>several columns?
>>
>>Thanks a lot for any pointers!
>>
>>--
>>Dimitri Liakhovitski
>>
>> [[alternative HTML version deleted]]
>>
>>______________________________________________
>>R-help at r-project.org mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>--
>
>Dimitri Liakhovitski
--
Dimitri Liakhovitski
More information about the R-help
mailing list