[R] vectorize data string analysis

John McKown john.archie.mckown at gmail.com
Tue Mar 3 14:32:59 CET 2015


Have you looked at readChar() ? You can use it to read your input file
in undelimited chunks of 220 bytes, and at the same time, parse into
variable. Look at the example in the help ?readChar. Unfortunately I
can't really see exactly how the columns break, but I have your data
in a file: x.data in my case. And I accessed it as follows in R

> lengths<-c(9,9,66,40,30,15,2,5,6,4,8,26)
> close(zz)
> lengths<-c(9,9,66,40,30,15,2,5,6,4,8,26)
> sum(lengths)
[1] 220
> zz=file("x.data","rb")
> readChar(zz,lengths)
 [1] "31365EJ46"
 [2] " CI125483"
 [3] "  00002003473100OCT03000003103340610.1548980406.500030197040112180"
 [4] "MULTIPLE POOL                           "
 [5] "                              "
 [6] "               "
 [7] "  "
 [8] "00000"
 [9] "070147"
[10] "FNMS"
[11] " 06.500 "
[12] "CI125483070170096000000000"
> readChar(zz,lengths)
 [1] "31371KMA6"
 [2] " CL254253"
 [3] "  00001304570700OCT03000010156865640.7785600006.000030102030132357"
 [4] "MULTIPLE POOL                           "
 [5] "                              "
 [6] "               "
 [7] "  "
 [8] "00000"
 [9] "067230"
[10] "FNMS"
[11] " 06.000 "
[12] "CL254253067150333000000000"
> # a lot more of the above, and finally
> readChar(zz,lengths)
 [1] "31403GNG3"
 [2] " LB748391"
 [3] "  00000715661500OCT03000007007212290.9791238304.379090103080133358"
 [4] "DLJ MORTGAGE CAPITAL INC.               "
 [5] "ELEVEN MADISON AVENUE         "
 [6] "NEW YORK       "
 [7] "NY"
 [8] "10010"
 [9] "056530"
[10] "FNAR"
[11] " XX.XXX "
[12] "LB748391000000000000000000"
> readChar(zz,lengths)
[1] "\n"
> readChar(zz,lengths)
character(0)

Note that the next to last readChar got a "\n" simply because my
editor puts that at the end of the file. So it likely won't be there
in your file.

On Mon, Mar 2, 2015 at 9:01 PM, Glenn Schultz <glennmschultz at me.com> wrote:
> Hello All,
>
> I have to admit that I am not that good when it comes to vectorizing a
> function.  I need some insight.  Is the below a case where vectorization can
> be accomplished to improve speed?
>
> Below the function a sample data - as you can see it is not delimited.
> However, the record length is 220 characters.  So I wrote the following code
> to delimit the data set "/r".  The function works and I have a dataset that
> can then be inserted into a MySql data table.  However, the actual data set
> is 518,000 records so the number of characters is 518000 * 220.  It takes R
> hours to parse this using the function I have written.  Can this be
> vectorized or is this a loop deal?
>
> Best Regards,
> Glenn
>
> #' FNMA Factor
>   #'
>   #' This function parses the FNMA factor file for load into
>   #' into a database table the FNMA factor file is non-delimited
>   #' @param filepath A character vector specifying a data director
>   #' @param lenght of the line A numeric value equal to the length of a line
>   #' @export
>   FNMAFactor <- function(filepath = character){
>   callpath <- paste(filepath,"mbsfact.txt", sep = "")
>   returnpath <- paste(filepath,"factor.txt", sep = "")
>   data <- readLines(con = callpath)
>   numchar <- nchar(data, type = "chars")
>   start <- c(seq(1, numchar, 220))
>   end <- c(seq(220, numchar, 220))
>   for(i in 1 : length(start)){
>   write(str_sub(data, start[i], end[i]), file = returnpath, append = TRUE)}
>   }
>
>
>
> 31365EJ46 CI125483
> 00002003473100OCT03000003103340610.1548980406.500030197040112180MULTIPLE
> POOL
> 00000070147FNMS 06.500 CI12548307017009600000000031371KMA6 CL254253
> 00001304570700OCT03000010156865640.7785600006.000030102030132357MULTIPLE
> POOL
> 00000067230FNMS 06.000 CL25425306715033300000000031371RE44 CL259455
> 00000983651400OCT03000003447615880.3504916406.500050102050132357MULTIPLE
> POOL
> 00000070200FNMS 06.500 CL25945507045034000000000031376KBB1 CL357434
> 00002505145900OCT03000025021294240.9987958905.000090103090133359MULTIPLE
> POOL
> 00000055000FNMS 05.000 CL35743405500035800000000031385XE52 WS555556
> 00003651248300OCT03000033344198060.9132273504.575050103050133356MEGA POOL
> ** NOT AN ACTIVE SERVICER **                   00000052440FNAR 04.595
> WS55555600000000000000000031385XLL9 WS555731
> 00013439369600OCT03000129242191330.9616685505.360080103040133352MEGA POOL
> ** NOT AN ACTIVE SERVICER **                   00000075160FNAR 05.368
> WS55573100000000000000000031390XG87 CI659123
> 00000208856500OCT03000001136251660.5440346206.000080102080117179WASHINGTON
> MUTUAL BANK, FA              19850 PLUMMER STREET          CHATSWORTH
> CA91311069210FNMS 06.000 CI65912306909016500000000031403BTR4 CL744060
> 00000770371700OCT03000007694084860.9987496805.000090103080133356MULTIPLE
> POOL
> 00000053920FNMS 05.000 CL74406000000000000000000031403GND0 LB748388
> 00000952312900OCT03000009512089400.9988407604.525090103080133358DLJ MORTGAGE
> CAPITAL INC.               ELEVEN MADISON AVENUE         NEW YORK
> NY10010058430FNAR XX.XXX LB74838800000000000000000031403GNG3 LB748391
> 00000715661500OCT03000007007212290.9791238304.379090103080133358DLJ MORTGAGE
> CAPITAL INC.               ELEVEN MADISON AVENUE         NEW YORK
> NY10010056530FNAR XX.XXX LB748391000000000000000000
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown



More information about the R-help mailing list