[R] Exceptional slowness with read.csv
Rui Barradas
ru|pb@rr@d@@ @end|ng |rom @@po@pt
Wed Apr 10 15:46:00 CEST 2024
Às 06:47 de 08/04/2024, Dave Dixon escreveu:
> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with read.csv(text
> = ...) is really slow. I know that the first 2459465 records are good.
> So I try this:
>
> > startTime <- Sys.time()
> > first_records <- read.csv(file_name, nrows = 2459465)
> > endTime <- Sys.time()
> > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time = 24.12598
>
> > startTime <- Sys.time()
> > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> > endTime <- Sys.time()
> > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Hello,
Can the following function be of help?
After reading the data setting argument quote=FALSE, call a function
applying gregexpr to its character columns, then transforming the output
in a two column data.frame with columns
Col - the column processed;
Unbalanced - the rows with unbalanced double quotes.
I am assuming the quotes are double quotes. It shouldn't be difficult to
adapt it to other cas, single quotes, both cases.
unbalanced_dquotes <- function(x) {
char_cols <- sapply(x, is.character) |> which()
lapply(char_cols, \(i) {
y <- x[[i]]
Unbalanced <- gregexpr('"', y) |>
sapply(\(x) attr(x, "match.length") |> length()) |>
{\(x) (x %% 2L) == 1L}() |>
which()
data.frame(Col = i, Unbalanced = Unbalanced)
}) |>
do.call(rbind, args = _)
}
# read the data disregardin g quoted strings
df1 <- read.csv(fl, quote = "")
# determine which strings have unbalanced quotes and
# where
unbalanced_dquotes(df1)
Hope this helps,
Rui Barradas
--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença de vírus.
www.avg.com
More information about the R-help
mailing list