[R] importing files, columns "invade next column"

Thu Jan 20 02:12:31 CET 2005

Thanks again Marc for your help.

At this point I already have the whole file as a data.frame in R (via 
Splus dump and then R source) so my problem for this specific problem 
is solved.

I had changed my file in Excel and thought everything was fine but 
apparently it wasn't. What program is used to display a tab file 
separated in columns that doesn't corrupt the data?

I tried again from the initial file and a very simple:
x <- read.table('file.txt', header=T, sep='\t') works fine. The 
sep='\t' is very important, otherwise the columns are imported in the 
wrong places when there are empty spaces next to them
I would suggest again advising people to use sep='\t' for tab 
delimited files in the help page for read.data.

##

If anyone is interested in a detailed history of the problem:

I had gotten my initial by exporting from Splus6.1, windows 2000 as a 
tab delimited file.

I tried to open the file in R, it didn't work and I opened the file 
in EXCEL and substituted the empty cells with NA. I saved the file as 
txt file - tab delimited. This was the file that I could not read 
only 9543 lines instead of the 15797 that the file is. The file is 
probably corrupted through the use of Excel, so I guess the lesson is 
don't do this in Excel.

I went back to Splus, exported a new tab delimited file and tried again:

x <- read.table('file.txt', header=T, sep='\t') #works fine

x <- read.table('file.txt', header=T) #gives an error
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
	line 1 did not have 194 elements

x <- read.table('file.txt', header=T, fill=T) #wrong columns take 
empty (NA) space

x <- read.table('file.txt', header=T, fill=T, sep='\t') #works fine

>On Wed, 2005-01-19 at 19:28 +0000, Tiago R Magalhaes wrote:
>>  Thanks very much Mark and Prof Ripley
>>
>>  a) using sep='\t' when using read.table() helps somewhat
>>
>>  there is still a problem:I cannot get all the lines:
>>  df <- read.table('file.txt', fill=T, header=T, sep='\t')
>>  dim(df)
>>    9543  195
>>
>>  while with the shorter file (11 cols) I get all the rows
>>  dim(df)
>>    15797    11
>>
>>  I have looked at row 9544 where the file seems to stop reading, but I
>>  cannot see in any of the cols an obvious reason for this to happen.
>>  Any ideas why? Maybe there is one col that is stopping the reading
>>  process and that column is not one of the 11 that are present in the
>>  smaller file.
>>
>>  b) fill=T is necessary
>>  without fill=T, I get an error:
>>  "line 1892 did not have 195 elements"
>
>Tiago,
>
>How was this data file generated? Is it a raw file created by some other
>application or was it an ASCII export, perhaps from a spreadsheet or
>database program?
>
>It seems that there is something inconsistent in the large data file,
>which is either by design or perhaps the result of being corrupted by a
>poor export.
>
>It may be helpful to know how the file was generated in the effort to
>assist you.
>
>>  c) help page for read.table
>>  I reread the help file for read.table and I would suggest to change
>>  it. From what I think I am reading, the '\t' would not be needed to
>>  work in my file, but it actually is:from the help page:
>>
>>    If 'sep = ""' (the default for 'read.table') the separator is "white
>>  space", that is one or more spaces, tabs or newlines.
>
>Under normal circumstances, this should not be a problem, but given the
>unknowns about your file, it leaves an open question as to the etiology
>of the incorrect import.
>
>Marc