[R] suggestions regarding reading in a messy file

Juliet Hannah juliet.hannah at gmail.com
Tue Jul 12 22:37:04 CEST 2011


I have a file in stata format, which I have read in, and I am trying
to create a text file. I have exported the data using various
delimiters, but I'm unable to read it back in. I originally read in
the file with:

library(foreign)
myData <- read.dta("mydata.dta")

I then exported it with write.table using comma, tab, and exclamation
marks as a delimiter.

When I was unable to read in it, I used readLines to check the number
of fields in each row. For example, when using a comma, I checked the
number of entries in each line using:

con <- file("myfile.txt", "r")
while ( length(oneLine <- readLines(con, 1)) ) {
   lineLength <- length(strsplit(oneLine,",")[[1]])
  cat(lineLength,"\n")
   }
close(con)

This prints out 57 for each line.

But then I try:

 cc <- rep("character",57)
 myData <- read.table("myfile.txt",header=TRUE,sep=",",colClasses=cc)

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  line 10 did not have 57 elements

 I'm unable to post a sample of the data so I'm just looking for
suggestions. The data  is messy meaning some of the fields have
comments as the survey response. Still, I was able to work with it as
long as I read it in from the stata  file.

I was trying to avoid using the 'fill' option because that has given
me problems before.

Thanks for your help.

Juliet

> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
                      LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] foreign_0.8-43



More information about the R-help mailing list