[R] suggestions regarding reading in a messy file
Juliet Hannah
juliet.hannah at gmail.com
Tue Jul 12 22:37:04 CEST 2011
I have a file in stata format, which I have read in, and I am trying
to create a text file. I have exported the data using various
delimiters, but I'm unable to read it back in. I originally read in
the file with:
library(foreign)
myData <- read.dta("mydata.dta")
I then exported it with write.table using comma, tab, and exclamation
marks as a delimiter.
When I was unable to read in it, I used readLines to check the number
of fields in each row. For example, when using a comma, I checked the
number of entries in each line using:
con <- file("myfile.txt", "r")
while ( length(oneLine <- readLines(con, 1)) ) {
lineLength <- length(strsplit(oneLine,",")[[1]])
cat(lineLength,"\n")
}
close(con)
This prints out 57 for each line.
But then I try:
cc <- rep("character",57)
myData <- read.table("myfile.txt",header=TRUE,sep=",",colClasses=cc)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 10 did not have 57 elements
I'm unable to post a sample of the data so I'm just looking for
suggestions. The data is messy meaning some of the fields have
comments as the survey response. Still, I was able to work with it as
long as I read it in from the stata file.
I was trying to avoid using the 'fill' option because that has given
me problems before.
Thanks for your help.
Juliet
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] foreign_0.8-43
More information about the R-help
mailing list