[R] suggestions regarding reading in a messy file

David Winsemius dwinsemius at comcast.net
Tue Jul 12 22:48:43 CEST 2011


On Jul 12, 2011, at 4:37 PM, Juliet Hannah wrote:

> I have a file in stata format, which I have read in, and I am trying
> to create a text file. I have exported the data using various
> delimiters, but I'm unable to read it back in. I originally read in
> the file with:
>
> library(foreign)
> myData <- read.dta("mydata.dta")
>
> I then exported it with write.table using comma, tab, and exclamation
> marks as a delimiter.
>
> When I was unable to read in it, I used readLines to check the number
> of fields in each row. For example, when using a comma, I checked the
> number of entries in each line using:
>
> con <- file("
> while ( length(oneLine <- readLines(con, 1)) ) {
>   lineLength <- length(strsplit(oneLine,",")[[1]])
>  cat(lineLength,"\n")
>   }
> close(con)
>
> This prints out 57 for each line.

But does not test for unmatched quotes, extraneous "#",  and such.

Try instead:

count.fields(myfile.txt", sep=",")

>
> But then I try:
>
> cc <- rep("character",57)
> myData <- read.table("myfile.txt",header=TRUE,sep=",",colClasses=cc)
>
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,  
> na.strings,  :
>  line 10 did not have 57 elements
>
> I'm unable to post a sample of the data so I'm just looking for
> suggestions. The data  is messy meaning some of the fields have
> comments as the survey response. Still, I was able to work with it as
> long as I read it in from the stata  file.
>
> I was trying to avoid using the 'fill' option because that has given
> me problems before.
>
> Thanks for your help.
>
> Juliet
>
>> sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
> States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>                      LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] foreign_0.8-43
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list