[R] Parsing

Wed Jul 9 16:57:27 CEST 2008

Thanks so much Jim! It works without a glitch!
My only problem is that the text files to be parsed are quite big, up to 
several thousands rows (my apologies for the incomplete informations in 
my former post), so loops are not my first choice. I'll take a look at 
'lapply' using your code as a model. Thanks again!

Sincerely,
Paolo

jim holtman ha scritto:
> This should do what you want: (it uses loops; you can work at
> replacing those with 'lapply' and such -- it all depends on if it is
> going to take you more time to rewrite the code than to process a set
> of data; you never did say how large the data was).  This also "grows"
> a data.frame, but you have not indicated how efficient is has to be.
> So this could be used as a model.
>
>   
>> x <- readLines(textConnection("x      x_string
>>     
> + y      y_string
> + id1    id1_string
> + id2    id2_string
> + z      z_string
> + w      w_string
> + stuff  stuff  stuff
> + stuff  stuff  stuff
> + stuff  stuff  stuff
> + //
> + x      x_string1
> + y      y_string1
> + z      z_string1
> + w      w_string1
> + stuff  stuff  stuff
> + stuff  stuff  stuff
> + stuff  stuff  stuff
> + //
> + x      x_string2
> + y      y_string2
> + id1    id1_string1
> + id2    id2_string1
> + z      z_string2
> + w      w_string2
> + stuff  stuff  stuff
> + stuff  stuff  stuff
> + stuff  stuff  stuff
> + //"))
>   
>> # I assume that each group is delimited by "//"
>> # initialize data.frame with desired values
>> .keys <- data.frame(x=NA, y=NA, id1=NA, id2=NA, w=NA)
>> .out <- .keys  # for the first pass
>> .save <- NULL
>> for (i in seq_along(x)){
>>     
> +     if (x[i] == "//"){  # output the current data
> +         .save <- rbind(.save, .out)
> +         .out <- .keys    # setup for the next pass
> +     } else {
> +         .split <- strsplit(x[i], "\\s+")
> +         if (.split[[1]][1] %in% names(.out)){
> +             .out[[.split[[1]][1]]] <- .split[[1]][2]
> +         }
> +     }
> + }
>   
>> .save
>>     
>           x         y         id1         id2         w
> 1  x_string  y_string  id1_string  id2_string  w_string
> 2 x_string1 y_string1        <NA>        <NA> w_string1
> 3 x_string2 y_string2 id1_string1 id2_string1 w_string2
>
>
> On Wed, Jul 9, 2008 at 5:33 AM, Paolo Sonego <paolo.sonego a gmail.com> wrote:
>   
>> Dear R users,
>>
>> I have a big text file formatted like this:
>>
>> x      x_string
>> y      y_string
>> id1    id1_string
>> id2    id2_string
>> z      z_string
>> w      w_string
>> stuff  stuff  stuff
>> stuff  stuff  stuff
>> stuff  stuff  stuff
>> //
>> x      x_string1
>> y      y_string1
>> z      z_string1
>> w      w_string1
>> stuff  stuff  stuff
>> stuff  stuff  stuff
>> stuff  stuff  stuff
>> //
>> x      x_string2
>> y      y_string2
>> id1    id1_string1
>> id2    id2_string1
>> z      z_string2
>> w      w_string2
>> stuff  stuff  stuff
>> stuff  stuff  stuff
>> stuff  stuff  stuff
>> //
>> ...
>> ...
>>
>>
>> I'd like to parse this file and retrieve the x, y, id1, id2, z, w fields and
>> save them into a a matrix object:
>>
>> x        y          id1         id2         z          w
>> x_string y_string   id1_string  id2_string  z_string   w_string  x_string1
>> y_string1 NA          NA          z_string1  w_string1
>> x_string2 y_string2 id1_string1 id2_string1 z_string2  w_string2
>> ...
>> ...
>>
>> id1, id2 fields  are not always present within a section (the interval
>> between x and the last stuff) and
>> I'd like to insert a NA when they are absent (see above) so that
>> length(x)==length(y)==length(id1)==... .
>>
>> Without the id1, id2 fields the task is easily solvable  importing the text
>> file with readLines and retrieving the single fields with grep:
>>
>> input = readLines("file.txt")
>> x = grep("^x\\s", input, value = T)
>> id1 = grep("^id1\\s", input, value = T)
>> ...
>>
>> I'd like to accomplish this task entirely in R (no SQL, no perl script),
>>  possibly without using loops.
>>
>> Any suggestions are quite welcome!
>>
>> Regards,
>> Paolo
>>
>> ______________________________________________
>> R-help a r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>     
>
>
>
>