[R] Exceptional slowness with read.csv

Dave Dixon dd|xon @end|ng |rom @wcp@com
Mon Apr 8 22:23:41 CEST 2024


Good suggestion - I'll look into data.table.

On 4/8/24 12:14, CALUM POLWART wrote:
> data.table's fread is also fast. Not sure about error handling. But I 
> can merge 300 csvs with a total of 0.5m lines and 50 columns in a 
> couple of minutes versus a lifetime with read.csv or readr::read_csv
>
>
>
> On Mon, 8 Apr 2024, 16:19 Stevie Pederson, 
> <stephen.pederson.au using gmail.com> wrote:
>
>     Hi Dave,
>
>     That's rather frustrating. I've found vroom (from the package
>     vroom) to be
>     helpful with large files like this.
>
>     Does the following give you any better luck?
>
>     vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
>
>     Of course, when you know you've got errors & the files are big
>     like that it
>     can take a bit of work resolving things. The command line tools
>     awk & sed
>     might even be a good plan for finding lines that have errors &
>     figuring out
>     a fix, but I certainly don't envy you.
>
>     All the best
>
>     Stevie
>
>     On Tue, 9 Apr 2024 at 00:36, Dave Dixon <ddixon using swcp.com> wrote:
>
>     > Greetings,
>     >
>     > I have a csv file of 76 fields and about 4 million records. I
>     know that
>     > some of the records have errors - unmatched quotes, specifically.
>     > Reading the file with readLines and parsing the lines with
>     read.csv(text
>     > = ...) is really slow. I know that the first 2459465 records are
>     good.
>     > So I try this:
>     >
>     >  > startTime <- Sys.time()
>     >  > first_records <- read.csv(file_name, nrows = 2459465)
>     >  > endTime <- Sys.time()
>     >  > cat("elapsed time = ", endTime - startTime, "\n")
>     >
>     > elapsed time =   24.12598
>     >
>     >  > startTime <- Sys.time()
>     >  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>     >  > endTime <- Sys.time()
>     >  > cat("elapsed time = ", endTime - startTime, "\n")
>     >
>     > This appears to never finish. I have been waiting over 20 minutes.
>     >
>     > So why would (skip = 2459465, nrows = 5) take orders of
>     magnitude longer
>     > than (nrows = 2459465) ?
>     >
>     > Thanks!
>     >
>     > -dave
>     >
>     > PS: readLines(n=2459470) takes 10.42731 seconds.
>     >
>     > ______________________________________________
>     > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>     > https://stat.ethz.ch/mailman/listinfo/r-help
>     > PLEASE do read the posting guide
>     > http://www.R-project.org/posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>     > and provide commented, minimal, self-contained, reproducible code.
>     >
>
>             [[alternative HTML version deleted]]
>
>     ______________________________________________
>     R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>     and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]



More information about the R-help mailing list