[R] Exceptional slowness with read.csv

Wed Apr 10 18:19:52 CEST 2024

That's basically what I did

1. Get text lines using readLines
2. use tryCatch to parse each line using read.csv(text=...)
3. in the catch, use gregexpr to find any quotes not adjacent to a comma 
(gregexpr("[^,]\"[^,]",...)
4. escape any quotes found by adding a second quote (using str_sub from 
stringr)
6. parse the patched text using read.csv(text=...)
7. write out the parsed fields as I go along using write.table(..., 
append=TRUE) so I'm not keeping too much in memory.

I went directly to tryCatch because there were 3.5 million records, and 
I only expected a few to have errors.

I found only 6 bad records, but it had to be done to make the datafile 
usable with read.csv(), for the benefit of other researchers using these 
data.

On 4/10/24 07:46, Rui Barradas wrote:
> Às 06:47 de 08/04/2024, Dave Dixon escreveu:
>> Greetings,
>>
>> I have a csv file of 76 fields and about 4 million records. I know 
>> that some of the records have errors - unmatched quotes, 
>> specifically. Reading the file with readLines and parsing the lines 
>> with read.csv(text = ...) is really slow. I know that the first 
>> 2459465 records are good. So I try this:
>>
>>  > startTime <- Sys.time()
>>  > first_records <- read.csv(file_name, nrows = 2459465)
>>  > endTime <- Sys.time()
>>  > cat("elapsed time = ", endTime - startTime, "\n")
>>
>> elapsed time =   24.12598
>>
>>  > startTime <- Sys.time()
>>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>>  > endTime <- Sys.time()
>>  > cat("elapsed time = ", endTime - startTime, "\n")
>>
>> This appears to never finish. I have been waiting over 20 minutes.
>>
>> So why would (skip = 2459465, nrows = 5) take orders of magnitude 
>> longer than (nrows = 2459465) ?
>>
>> Thanks!
>>
>> -dave
>>
>> PS: readLines(n=2459470) takes 10.42731 seconds.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> Can the following function be of help?
> After reading the data setting argument quote=FALSE, call a function 
> applying gregexpr to its character columns, then transforming the 
> output in a two column data.frame with columns
>
>  Col - the column processed;
>  Unbalanced - the rows with unbalanced double quotes.
>
> I am assuming the quotes are double quotes. It shouldn't be difficult 
> to adapt it to other cas, single quotes, both cases.
>
>
>
>
> unbalanced_dquotes <- function(x) {
>   char_cols <- sapply(x, is.character) |> which()
>   lapply(char_cols, \(i) {
>     y <- x[[i]]
>     Unbalanced <- gregexpr('"', y) |>
>       sapply(\(x) attr(x, "match.length") |> length()) |>
>       {\(x) (x %% 2L) == 1L}() |>
>       which()
>     data.frame(Col = i, Unbalanced = Unbalanced)
>   }) |>
>   do.call(rbind, args = _)
> }
>
> # read the data disregardin g quoted strings
> df1 <- read.csv(fl, quote = "")
> # determine which strings have unbalanced quotes and
> # where
> unbalanced_dquotes(df1)
>
>
> Hope this helps,
>
> Rui Barradas
>
>