[R] splitting very long character string
Marc Schwartz
marc_schwartz at comcast.net
Wed Nov 1 17:05:16 CET 2006
On Wed, 2006-11-01 at 16:47 +0100, Arne.Muller at sanofi-aventis.com wrote:
> Hello,
>
> I've a very long character array (>500k characters) that need to split
> by '\n' resulting in an array of about 60k numbers. The help on
> strsplit says to use perl=TRUE to get better formance, but still it
> takes several minutes to split this string.
>
> The massive string is the return value of a call to
> xmlElementsByTagName from the XML library and looks like this:
>
> ...
> 12345
> 564376
> 5674
> 6356656
> 5666
> ...
>
> I've to read about a hundred of these files and was wondering whether
> there's a more efficient way to turn this string into an array of
> numerics. Any ideas?
>
> thanks a lot for your help
> and kind regards,
>
> Arne
>
Vec <- sample(c(0:9, "\n"), 500000, replace = TRUE)
> str(Vec)
chr [1:500000] "7" "0" "9" "6" "5" "3" "1" "9" ...
> table(Vec)
Vec
\n 0 1 2 3 4 5 6 7 8 9
45432 45723 45641 45526 45460 45284 45378 45392 45374 45314 45476
> sink("Vec.txt")
> cat(Vec)
> sink()
First 10 lines of Vec.txt:
7 0 9 6 5 3 1 9 8 1 8 3 4 2
1 2 2
3 7 7 6 8 3 4 7 4
9 2 1 9 8 7 2 0 9 4 3
9 3 5 2 2 5 8 0 5 4 5 6 1 5 8 7 4 1 2 8 3 2 6 4 9 4 1 6 8 5 0 8 8 8 5 3 0 5 3 5 4 8 5 4 3
9
5 3 6 5 8 9 7 6 9
5 8
2 4 6
5
> system.time(Vec.Split <- scan("Vec.txt", sep = "\n"))
Read 41276 items
[1] 0.180 0.004 0.186 0.000 0.000
> str(Vec.Split)
num [1:41276] 7.10e+13 1.22e+02 3.78e+08 9.22e+10 9.35e+44 ...
> sprintf("%.0f", Vec.Split[1:10])
[1] "70965319818342"
[2] "122"
[3] "377683474"
[4] "92198720943"
[5] "935225805456158720742405574866620654670577664"
[6] "9"
[7] "536589769"
[8] "58"
[9] "246"
[10] "5"
Does that help?
Marc Schwartz
More information about the R-help
mailing list