[R] splitting strings effriciently

Sun Jan 8 20:37:56 CET 2012

Just a quick followup to the previous post using 4M entries:  (20
seconds would seem like a reasonable time for the operation)

>  ip <- "123.456.789.321"  ## example data
>  df <- data.frame(ip = rep(ip, 4e6), stringsAsFactors=FALSE)
>  system.time(x <- strsplit(df$ip, '\\.'))
   user  system elapsed
  19.47    0.12   20.86
>  str(x)
List of 4000000
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"

On Sun, Jan 8, 2012 at 8:11 AM, Enrico Schumann <enricoschumann at yahoo.de> wrote:
>
> Hi Andrew,
>
> you can use strsplit for a character vector; you do not have to call it for
> every element data$ComputerName[i].
>
> If I understand correctly, maybe something like this helps
>
>> ip <- "123.456.789.321"  ## example data
>> df <- data.frame(ip = rep(ip, 9), stringsAsFactors=FALSE)
>> df
>               ip
> 1 123.456.789.321
> 2 123.456.789.321
> 3 123.456.789.321
> 4 123.456.789.321
> 5 123.456.789.321
> 6 123.456.789.321
> 7 123.456.789.321
> 8 123.456.789.321
> 9 123.456.789.321
>
>>
>> res <- unlist(strsplit(df[["ip"]], "\\."))
>> ii <- seq(1, nrow(df)*4, by = 4)
>> res[ii]   ## A
> [1] "123" "123" "123" "123" "123" "123" "123"
> [8] "123" "123"
>> res[ii+1] ## B
> [1] "456" "456" "456" "456" "456" "456" "456"
> [8] "456" "456"
>> res[ii+2] ## C
> [1] "789" "789" "789" "789" "789" "789" "789"
> [8] "789" "789"
>> res[ii+3] ## D
> [1] "321" "321" "321" "321" "321" "321" "321"
> [8] "321" "321"
>
>
> Regards,
> Enrico
>
>
> Am 08.01.2012 11:06, schrieb Andrew Roberts:
>
>> Folks,
>>
>> I have a data frame with 4861469 rows that contains an ip address
>> xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each
>> row based on IP ranges. To do this I have a function to split the ip
>> address as character into class A,B,C and D components. It works but is
>> horribly inefficient in terms of speed. I can't quite see how one of the
>> l/s/m/t/apply functions could be brought to bear on the problem. Does
>> anyone have any thoughts?
>>
>> for(i in 1:4861469)
>>    {
>>    lst<-unlist(strsplit(data$ComputerName[i], "\\."))
>>    data$IPA[i]<-lst[[1]]
>>    data$IPB[i]<-lst[[2]]
>>    data$IPC[i]<-lst[[3]]
>>    data$IPD[i]<-lst[[4]]
>>    rm(lst)
>>    }
>>
>> Andrew
>>
>> Andrew Roberts
>> Children's Orthopaedic Surgeon
>> RJAH, Oswestry, UK
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> Enrico Schumann
> Lucerne, Switzerland
> http://nmof.net/
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.