[R] Read data with different column lengths
Hofert Marius
m_hofert at web.de
Wed Mar 7 22:06:03 CET 2007
Dear r-help users,
I have the following simple problem: Reading data from a file. The
file is a .txt file exported ("save as...") from Excel (see below for
an example). The Excel file consists of two header rows (first row
consists of ticker symbols of stocks, the second row consists of
column explanations ("Date","Px Last"), followed by several rows of
data. Now forget about the first two rows, I can deal with that (read
separately, then extract the actual ticker symbols "ADS",
"ALV", ...). For reading the rest, I tried several things, for example:
data=read.table(infile,quote="",fill=T,dec=",",skip=2,colClasses=rep(c
("character","numeric"),ntickers))
or
data=matrix(scan(file=infile,what=rep(c
("character","numeric"),ntickers),dec=",",skip=2),ncol=2*ntickers,byrow=
T)
where "infile" specifies the path to the input file and "ntickers" is
the number of ticker-columns in the data set, so in the example
below, ntickers=2.
Both ways of reading the data work perfectly fine if all columns have
the same length (i.e. the same number of filled rows), so if the data
is given in a (filled) "rectangular" form. Now, as you can imagine,
there are days when one stock is traded but not the other... so,
there might be columns that do not have the same number of filled
rows (see below, for the stock with ticker symbol "ADS", only 3
trading days are shown, so this column is shorter than the data
column for the stock "ALV"). Now, if I export such a structure to
a .txt file, then all (by default) blank fields will be replaced by
"\t", i.e. tabs. Both reading procedures as give above have problems
as they either display that the number of rows/columns do not fit
together or as they read the table, but some cells are shifted to the
left (for the example below, the entry "07/02/05 134,7" appears in
the empty field of the stock "ADS" which is of course not what we want).
So the simple question is: How do I read such a structure?
Can there be a simple solution? The problem is simply that empty
cells are replace by "\t" which are then ignored for reading. So how
do we distinguish between the empty cells that are given between the
columns and the empty cells that actually "fill" a column to have the
same length as other columns. Of course I could manually put in a
certain character (e.g. a "*") to fill in the gaps, but the data set
is simply too large. If it helps, these blank fields only appear in
the end of each column, not in the middle.
As I work on a Mac (OS X 10.4), it was not possible (at least to me)
to read the data directly from the Excel file vial the library RODBC
or read.xls.
Note, that the same problem arises, when I export the Excel file as
a .csv, then all blank fields are separated by ";" instead of "\t"
and the reading procedure can also not decide if the field
corresponds to an empty separating column or actually to a column
with given entries, but which is simply not as long as another column
in the file.
Hope, you can help. I would really appreciate it.
Best regards.
Marius
Excel example (I hope it's displayed correctly, the entry in the last
row should be aligned with the last column):
ADS GY Equity ALV GY Equity
Date Px Last Date Px Last
07/02/04 41,395 07/01/31 130,234
07/02/05 42,134 07/02/01 133,353
07/02/06 41,875 07/02/04 133,824
07/02/05 134,734
More information about the R-help
mailing list