[R] incomplete reading of a large csv file
Uwe Ligges
||gge@ @end|ng |rom @t@t|@t|k@tu-dortmund@de
Fri Feb 21 20:16:45 CET 2020
On 21.02.2020 20:10, Christopher W. Ryan wrote:
> sessionInfo at end of message.
>
> I have data that I was given as an Excel .xlsx file. It contains 96266
> lines and 24 columns. I opened it in OpenOffice.org and saved it in .csv
> format, using the pipe character as a field separator. This produced a
> file with 96266 lines.
>
> When I read it into R thusly:
>
>> skip0.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv",
> sep = "|", header = TRUE, comment.char = "", skip = 0)
>
> the resulting skip0.dd dataframe has only 58208 lines:
>
>> dim(skip0.dd)
> [1] 58208 24
>
>
> I've tried a variety of things to troubleshoot. Using head() and tail(),
> the expected first and last lines (comparing to the .csv file) do indeed
> exist in skip0.dd. Several arbitrary lines from the "middle" of the csv
> file are also present in the skip0.dd dataframe.
>
> I tried reading only the first column, which is integer, but still it
> appears that not all lines are read in:
>
>> classes <- c(NA, rep("NULL", 23))
>> skip01.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv",
> sep = "|", header = TRUE, comment.char = "", skip = 0, colClasses = classes)
>> dim(skip01.dd)
> [1] 58208 1
>
> Skipping the first 50000 lines nominally should give me a dataframe of
> 46266 lines, or at least one of 50000 fewer lines than skip0.dd (i.e.
> 8208 lines), but it does neither:
>
>> skip50000.dd <-
> read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", sep = "|",
> header = TRUE, comment.char = "", skip = 50000)
>> dim(skip50000.dd)
> [1] 22170 24
>
> Any thoughts on what might be going wrong? Some funky characters from
> Excel or OpenOffice.org lurking in the .csv file?
quotes are a typical proiblem, what if you try with arg quote=""?
>
> Perhaps I'd have more success with one of the packages that enables
> reading directly from an .xlsx file.
>
> Thanks.
>
> --Chris Ryan
> SUNY Upstate Medical University Binghamton Clinical Campus
> Broome County Health Department
> Binghamton University
>
>
> ####################################
>> sessionInfo()
> R version 3.5.3 (2019-03-11)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 17763)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] dplyr_0.8.3 stringr_1.4.0 Hmisc_4.2-0 ggplot2_3.2.1
> [5] Formula_1.2-3 survival_2.44-1.1 lattice_0.20-38
>
> loaded via a namespace (and not attached):
> [1] Rcpp_1.0.1 pillar_1.4.0 compiler_3.5.3
> [4] RColorBrewer_1.1-2 tools_3.5.3 base64enc_0.1-3
> [7] digest_0.6.18 zeallot_0.1.0 rpart_4.1-13
> [10] checkmate_1.9.3 tibble_2.1.1 gtable_0.3.0
> [13] htmlTable_1.13.1 pkgconfig_2.0.2 rlang_0.4.0
> [16] Matrix_1.2-15 rstudioapi_0.10 xfun_0.7
> [19] gridExtra_2.3 knitr_1.23 withr_2.1.2
> [22] cluster_2.0.7-1 htmlwidgets_1.3 vctrs_0.2.0
> [25] grid_3.5.3 nnet_7.3-12 tidyselect_0.2.5
> [28] data.table_1.12.2 glue_1.3.1 R6_2.4.0
> [31] foreign_0.8-71 latticeExtra_0.6-28 purrr_0.3.2
> [34] magrittr_1.5 htmltools_0.3.6 backports_1.1.4
> [37] scales_1.0.0 splines_3.5.3 assertthat_0.2.1
> [40] colorspace_1.4-1 stringi_1.4.3 acepack_1.4.1
> [43] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list