[R] reading partial data set

bcdc bia.cdc at gmail.com
Wed Dec 7 15:52:09 CET 2011


Hi all,

I'm trying to read a data set into R, but the file is messy, so I have to do
it partially. The whole data is in a .txt file, and the values are separated
by a space. So far ok. The problem is that in this file, not all the lines
have the same number of elements, and the reading stops. And I loose the
reading of the previous lines.

ex. of data set:
11   12   13
21   22   23
31   32   33   34
41   42   43   44
51   52   53
61   62   63   64
71   72   73   74   75
81   82
(...)

If I use the following:

> aux <- read.table(file="data.txt", sep=" ", header=F,
> colClasses="numeric")

it stops the reading with the error message:

> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, 
> : 
   line 3 did not have 3 elements
   Calls: read.table -> scan

and I loose the reading of the previous reading. And since I'm running my
data in a cluster (it's actually a big data set) the error halts my
execution.

What I tried at first was to do

> aux1 <- read.table(file="data.txt", sep=" ", header=F,
> colClasses="numeric", nrow=2)
> aux2 <- read.table(file="data.txt", sep=" ", header=F,
> colClasses="numeric", skip=2, nrow=2)
> aux3 <- read.table(file="data.txt", sep=" ", header=F,
> colClasses="numeric", skip=4, nrow=1)
> (...)

This procedure works. However, I have about 5000 lines to read, and I don't
know precisely which ones are messy. So what I have to do, to keep the above
procedure is:

1. try to read data set
2. check error message to find out which line has different size
3. read data set for the block of same sized lines (aux1)
4. read data set skipping the lines read in aux1;
   check error message to find out which line has different size
5. read data set for second block of same sized lines (aux2)
6. read data set skipping the lines read in aux1 and aux2;
   check error message to find out which line has different size
(and so on)

So, if I had only a hundred lines, this would be OK, but I have a few
thousands, and It'll take me forever to finish reading if I need to read
block by block and check manually where is the problem.

My question is: is there anyway I can read my data with some "if's" or
"while's" to control the read.table? What I'd like to do is something like:

1. read data set while all lines has the same length
2. if a line has different length from the previous ones, store what was
read in a variable and abort reading
3. start reading data set from the line where it stopped, and read it while
all lines has the same length
4. if a line has different length from the previous ones, store what was
read in a variable and abort reading
5. start reading data set from the line where it stopped, and read it while
all lines has the same length
6. if a line has different length from the previous ones, store what was
read in a variable and abort reading
(and so on until the whole data set was finally read)

This would make the program run by itself, and solve my problem. It's OK if
it returns a couple of variables, I can just bind them and assemble my data
set as I need, since I know how it should look like in the end.

Thanks in advance for suggestions!
Beatriz

--
View this message in context: http://r.789695.n4.nabble.com/reading-partial-data-set-tp4169210p4169210.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list