[R] Incremental ReadLines

Mike Marchywka marchywka at hotmail.com
Thu Apr 14 12:19:18 CEST 2011






----------------------------------------
> Date: Wed, 13 Apr 2011 10:57:58 -0700
> From: frederiklang at gmail.com
> To: r-help at r-project.org
> Subject: Re: [R] Incremental ReadLines
>
> Hi there,
>
> I am having a similar problem with reading in a large text file with around
> 550.000 observations with each 10 to 100 lines of description. I am trying
> to parse it in R but I have troubles with the size of the file. It seems
> like it is slowing down dramatically at some point. I would be happy for any

This probably occurs when you run out of physical memory but you can
probably verify by looking at task manager. A "readline()" method
wouldn't fit real well with R as you try to had blocks of data
so that inner loops, implemented largely in native code, can operate
efficiently. The thing you want is a data structure that can use
disk more effectively and hide these details from you and algorightm.
This works best if the algorithm works with data strcuture to avoid
lots of disk thrashing. You coudl imagine that your "read" would do
nothing until each item is needed but often people want the whole
file validated before procesing, lots of details come up with exception
handling as you get fancy here. Note of course that your parse output
could be stored in a hash or something represnting a DOM and this could
get arbitrarily large. Since it is designed for random access, this may
cause lots of thrashing if partially on disk. Anything you can do to 
make access patterns more regular, for example sort your data, would help.


> suggestions. Here is my code, which works fine when I am doing a subsample
> of my dataset.
>
> #Defining datasource
> file <- "filename.txt"
>
> #Creating placeholder for data and assigning column names
> data <- data.frame(Id=NA)
>
> #Starting by case = 0
> case <- 0
>
> #Opening a connection to data
> input <- file(file, "rt")
>
> #Going through cases
> repeat {
> line <- readLines(input, n=1)
> if (length(line)==0) break
> if (length(grep("Id:",line)) != 0) {
> case <- case + 1 ; data[case,] <-NA
> split_line <- strsplit(line,"Id:")
> data[case,1] <- as.numeric(split_line[[1]][2])
> }
> }
>
> #Closing connection
> close(input)
>
> #Saving dataframe
> write.csv(data,'data.csv')
>
>
> Kind regards,
>
>
> Frederik
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
 		 	   		  


More information about the R-help mailing list