[R] reading in csv files, some of which have column names and some of which don't
peter dalgaard
pd@|gd @end|ng |rom gm@||@com
Tue Aug 13 23:27:45 CEST 2019
Yes. Also, the original poster said that the files had the same column structure, so there may be stronger heuristics to see whether the first line is a header line. E.g., assuming that the first column is called "ID" (and doesn't have ID as a possible value) use
first <- readLines(file, 1)
if (grepl("^ID", first)
...
else
...
-pd
> On 13 Aug 2019, at 20:39 , Sarah Goslee <sarah.goslee using gmail.com> wrote:
>
> Like Bert, I can't see an easy approach for datasets that have
> character rather than numeric data. But here's a simple approach for
> distinguishing files that have possible character headers but numeric
> data.
>
>
>
> readheader <- function(filename) {
>
> possibleheader <- read.table(filename, nrows=1, sep=",", header=FALSE)
>
> if(all(is.numeric(possibleheader[,1]))) {
> # no header
> infile <- read.table(filename, sep=",", header=FALSE)
> } else {
> # has header
> infile <- read.table(filename, sep=",", header=TRUE)
> }
>
> infile
> }
>
>
>
> #### file noheader.csv ####
>
> 1,1,1
> 2,2,2
> 3,3,3
>
>
> #### file hasheader.csv ####
>
> a,b,c
> 1,1,1
> 2,2,2
> 3,3,3
>
> ########################
>
>> readheader("noheader.csv")
> V1 V2 V3
> 1 1 1 1
> 2 2 2 2
> 3 3 3 3
>> readheader("hasheader.csv")
> a b c
> 1 1 1 1
> 2 2 2 2
> 3 3 3 3
>
> Sarah
>
> On Tue, Aug 13, 2019 at 2:00 PM Christopher W Ryan <cryan using binghamton.edu> wrote:
>>
>> Alas, we spend so much time and energy on data wrangling . . . .
>>
>> I'm given a collection of csv files to work with---"found data". They arose
>> via saving Excel files to csv format. They all have the same column
>> structure, except that some were saved with column names and some were not.
>>
>> I have a code snippet that I've used before to traverse a directory and
>> read into R all the csv files of a certain filename pattern within it, and
>> combine them all into a single dataframe:
>>
>> library(dplyr)
>> ## specify the csv files that I will want to access
>> files.to.read <- list.files(path = "H:/EH", pattern =
>> "WICLeadLabOrdersDone.+", all.files = FALSE, full.names = TRUE, recursive =
>> FALSE, ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
>>
>> ## function to read csv files back in
>> read.csv.files <- function(filename) {
>> bb <- read.csv(filename, colClasses = "character", header = TRUE)
>> bb
>> }
>>
>> ## now read the csv files, as all character
>> b <- lapply(files.to.read, read.csv.files)
>>
>> ddd <- bind_rows(b)
>>
>> But this assumes that all files have column names in their first row. In
>> this case, some don't. Any advice how to handle it so that those with
>> column names and those without are read in and combined properly? The only
>> thing I've come up with so far is:
>>
>> ## function to read csv files back in
>> ## Unfortunately, some of the csv files are saved with column headers, and
>> some are saved without them.
>> ## This presents a problem when defining the function to read them: header
>> = TRUE or header = FALSE?
>> ## The best solution I can think of as of 13 August 2019 is to use header =
>> FALSE and skip the
>> ## first row of every file. This will sacrifice one record from each csv of
>> about 80 files
>> read.csv.files <- function(filename) {
>> bb <- read.csv(filename, colClasses = "character", header = FALSE, skip
>> = 1)
>> bb
>> }
>>
>> This sacrifices about 80 out of about 1600 records. For my purposes in this
>> instance, this may be acceptable, but of course I'd rather not.
>>
>> Thanks.
>>
>> --Chris Ryan
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Sarah Goslee (she/her)
> http://www.numberwright.com
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk Priv: PDalgd using gmail.com
More information about the R-help
mailing list