[R] splitting strings effriciently

jim holtman jholtman at gmail.com
Sun Jan 8 20:37:56 CET 2012


Just a quick followup to the previous post using 4M entries:  (20
seconds would seem like a reasonable time for the operation)

>  ip <- "123.456.789.321"  ## example data
>  df <- data.frame(ip = rep(ip, 4e6), stringsAsFactors=FALSE)
>  system.time(x <- strsplit(df$ip, '\\.'))
   user  system elapsed
  19.47    0.12   20.86
>  str(x)
List of 4000000
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"




On Sun, Jan 8, 2012 at 8:11 AM, Enrico Schumann <enricoschumann at yahoo.de> wrote:
>
> Hi Andrew,
>
> you can use strsplit for a character vector; you do not have to call it for
> every element data$ComputerName[i].
>
> If I understand correctly, maybe something like this helps
>
>> ip <- "123.456.789.321"  ## example data
>> df <- data.frame(ip = rep(ip, 9), stringsAsFactors=FALSE)
>> df
>               ip
> 1 123.456.789.321
> 2 123.456.789.321
> 3 123.456.789.321
> 4 123.456.789.321
> 5 123.456.789.321
> 6 123.456.789.321
> 7 123.456.789.321
> 8 123.456.789.321
> 9 123.456.789.321
>
>>
>> res <- unlist(strsplit(df[["ip"]], "\\."))
>> ii <- seq(1, nrow(df)*4, by = 4)
>> res[ii]   ## A
> [1] "123" "123" "123" "123" "123" "123" "123"
> [8] "123" "123"
>> res[ii+1] ## B
> [1] "456" "456" "456" "456" "456" "456" "456"
> [8] "456" "456"
>> res[ii+2] ## C
> [1] "789" "789" "789" "789" "789" "789" "789"
> [8] "789" "789"
>> res[ii+3] ## D
> [1] "321" "321" "321" "321" "321" "321" "321"
> [8] "321" "321"
>
>
> Regards,
> Enrico
>
>
> Am 08.01.2012 11:06, schrieb Andrew Roberts:
>
>> Folks,
>>
>> I have a data frame with 4861469 rows that contains an ip address
>> xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each
>> row based on IP ranges. To do this I have a function to split the ip
>> address as character into class A,B,C and D components. It works but is
>> horribly inefficient in terms of speed. I can't quite see how one of the
>> l/s/m/t/apply functions could be brought to bear on the problem. Does
>> anyone have any thoughts?
>>
>> for(i in 1:4861469)
>>    {
>>    lst<-unlist(strsplit(data$ComputerName[i], "\\."))
>>    data$IPA[i]<-lst[[1]]
>>    data$IPB[i]<-lst[[2]]
>>    data$IPC[i]<-lst[[3]]
>>    data$IPD[i]<-lst[[4]]
>>    rm(lst)
>>    }
>>
>> Andrew
>>
>> Andrew Roberts
>> Children's Orthopaedic Surgeon
>> RJAH, Oswestry, UK
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> Enrico Schumann
> Lucerne, Switzerland
> http://nmof.net/
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.



More information about the R-help mailing list