[R] Exceptional slowness with read.csv
Dave Dixon
dd|xon @end|ng |rom @wcp@com
Mon Apr 8 22:22:20 CEST 2024
Thanks, yeah, I think scan is more promising. I'll check it out.
On 4/8/24 11:49, Bert Gunter wrote:
> No idea, but have you tried using ?scan to read those next 5 rows? It
> might give you a better idea of the pathologies that are causing
> problems. For example, an unmatched quote might result in some huge
> number of characters trying to be read into a single element of a
> character variable. As your previous respondent said, resolving such
> problems can be a challenge.
>
> Cheers,
> Bert
>
>
>
> On Mon, Apr 8, 2024 at 8:06 AM Dave Dixon <ddixon using swcp.com> wrote:
>
> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know
> that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with
> read.csv(text
> = ...) is really slow. I know that the first 2459465 records are
> good.
> So I try this:
>
> > startTime <- Sys.time()
> > first_records <- read.csv(file_name, nrows = 2459465)
> > endTime <- Sys.time()
> > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time = 24.12598
>
> > startTime <- Sys.time()
> > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> > endTime <- Sys.time()
> > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude
> longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> <http://www.R-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list