[R] reading in csv files, some of which have column names and some of which don't

peter dalgaard pd@|gd @end|ng |rom gm@||@com
Tue Aug 13 23:27:45 CEST 2019


Yes. Also, the original poster said that the files had the same column structure, so there may be stronger heuristics to see whether the first line is a header line. E.g., assuming that the first column is called "ID" (and doesn't have ID as a possible value) use 

first <- readLines(file, 1)
if (grepl("^ID", first) 
...
else
...

-pd
 

> On 13 Aug 2019, at 20:39 , Sarah Goslee <sarah.goslee using gmail.com> wrote:
> 
> Like Bert, I can't see an easy approach for datasets that have
> character rather than numeric data. But here's a simple approach for
> distinguishing files that have possible character headers but numeric
> data.
> 
> 
> 
> readheader <- function(filename) {
> 
> possibleheader <- read.table(filename, nrows=1, sep=",", header=FALSE)
> 
> if(all(is.numeric(possibleheader[,1]))) {
> # no header
> infile <- read.table(filename, sep=",", header=FALSE)
> } else {
> # has header
> infile <- read.table(filename, sep=",", header=TRUE)
> }
> 
> infile
> }
> 
> 
> 
> #### file noheader.csv ####
> 
> 1,1,1
> 2,2,2
> 3,3,3
> 
> 
> #### file hasheader.csv ####
> 
> a,b,c
> 1,1,1
> 2,2,2
> 3,3,3
> 
> ########################
> 
>> readheader("noheader.csv")
>  V1 V2 V3
> 1  1  1  1
> 2  2  2  2
> 3  3  3  3
>> readheader("hasheader.csv")
>  a b c
> 1 1 1 1
> 2 2 2 2
> 3 3 3 3
> 
> Sarah
> 
> On Tue, Aug 13, 2019 at 2:00 PM Christopher W Ryan <cryan using binghamton.edu> wrote:
>> 
>> Alas, we spend so much time and energy on data wrangling . . . .
>> 
>> I'm given a collection of csv files to work with---"found data". They arose
>> via saving Excel files to csv format. They all have the same column
>> structure, except that some were saved with column names and some were not.
>> 
>> I have a code snippet that I've used before to traverse a directory and
>> read into R all the csv files of a certain filename pattern within it, and
>> combine them all into a single dataframe:
>> 
>> library(dplyr)
>> ## specify the csv files that I will want to access
>> files.to.read <- list.files(path = "H:/EH", pattern =
>> "WICLeadLabOrdersDone.+", all.files = FALSE, full.names = TRUE, recursive =
>> FALSE, ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
>> 
>> ## function to read csv files back in
>> read.csv.files <- function(filename) {
>>    bb <- read.csv(filename, colClasses = "character", header = TRUE)
>>    bb
>> }
>> 
>> ## now read the csv files, as all character
>> b <- lapply(files.to.read, read.csv.files)
>> 
>> ddd <- bind_rows(b)
>> 
>> But this assumes that all files have column names in their first row. In
>> this case, some don't. Any advice how to handle it so that those with
>> column names and those without are read in and combined properly? The only
>> thing I've come up with so far is:
>> 
>> ## function to read csv files back in
>> ## Unfortunately, some of the csv files are saved with column headers, and
>> some are saved without them.
>> ## This presents a problem when defining the function to read them: header
>> = TRUE or header = FALSE?
>> ## The best solution I can think of as of 13 August 2019 is to use header =
>> FALSE and skip the
>> ## first row of every file. This will sacrifice one record from each csv of
>> about 80 files
>> read.csv.files <- function(filename) {
>>    bb <- read.csv(filename, colClasses = "character", header = FALSE, skip
>> = 1)
>>    bb
>> }
>> 
>> This sacrifices about 80 out of about 1600 records. For my purposes in this
>> instance, this may be acceptable, but of course I'd rather not.
>> 
>> Thanks.
>> 
>> --Chris Ryan
>> 
>>        [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
> -- 
> Sarah Goslee (she/her)
> http://www.numberwright.com
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com



More information about the R-help mailing list