[R] reading partial data set

Peter Langfelder peter.langfelder at gmail.com
Wed Dec 7 20:26:25 CET 2011


On Wed, Dec 7, 2011 at 6:52 AM, bcdc <bia.cdc at gmail.com> wrote:
> Hi all,
>
> I'm trying to read a data set into R, but the file is messy, so I have to do
> it partially. The whole data is in a .txt file, and the values are separated
> by a space. So far ok. The problem is that in this file, not all the lines
> have the same number of elements, and the reading stops. And I loose the
> reading of the previous lines.
>
> ex. of data set:
> 11   12   13
> 21   22   23
> 31   32   33   34
> 41   42   43   44
> 51   52   53
> 61   62   63   64
> 71 72 73 74 75
> 81   82
> (...)
>
> If I use the following:
>
>> aux <- read.table(file="data.txt", sep=" ", header=F,
>> colClasses="numeric")
>
> it stops the reading with the error message:
>
>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
>> :
>   line 3 did not have 3 elements
>   Calls: read.table -> scan
>
> and I loose the reading of the previous reading. And since I'm running my
> data in a cluster (it's actually a big data set) the error halts my
> execution.
>
> What I tried at first was to do
>
>> aux1 <- read.table(file="data.txt", sep=" ", header=F,
>> colClasses="numeric", nrow=2)
>> aux2 <- read.table(file="data.txt", sep=" ", header=F,
>> colClasses="numeric", skip=2, nrow=2)
>> aux3 <- read.table(file="data.txt", sep=" ", header=F,
>> colClasses="numeric", skip=4, nrow=1)
>> (...)
>
> This procedure works. However, I have about 5000 lines to read, and I don't
> know precisely which ones are messy. So what I have to do, to keep the above
> procedure is:
>
> 1. try to read data set
> 2. check error message to find out which line has different size
> 3. read data set for the block of same sized lines (aux1)
> 4. read data set skipping the lines read in aux1;
>   check error message to find out which line has different size
> 5. read data set for second block of same sized lines (aux2)
> 6. read data set skipping the lines read in aux1 and aux2;
>   check error message to find out which line has different size
> (and so on)
>
> So, if I had only a hundred lines, this would be OK, but I have a few
> thousands, and It'll take me forever to finish reading if I need to read
> block by block and check manually where is the problem.
>
> My question is: is there anyway I can read my data with some "if's" or
> "while's" to control the read.table? What I'd like to do is something like:
>
> 1. read data set while all lines has the same length
> 2. if a line has different length from the previous ones, store what was
> read in a variable and abort reading
> 3. start reading data set from the line where it stopped, and read it while
> all lines has the same length
> 4. if a line has different length from the previous ones, store what was
> read in a variable and abort reading
> 5. start reading data set from the line where it stopped, and read it while
> all lines has the same length
> 6. if a line has different length from the previous ones, store what was
> read in a variable and abort reading
> (and so on until the whole data set was finally read)
>
> This would make the program run by itself, and solve my problem. It's OK if
> it returns a couple of variables, I can just bind them and assemble my data
> set as I need, since I know how it should look like in the end.
>
> Thanks in advance for suggestions!
> Beatriz
>

I think you make it too complicated. Look at the help file for
read.table, particularly for argument fill (whose default AFAICS is
FALSE). If you set it to TRUE, the function will automatically fill
missing entries with NAs. After reading the file you can decide what
to do with lines that have some missing values.

HTH,

Peter

> --
> View this message in context: http://r.789695.n4.nabble.com/reading-partial-data-set-tp4169210p4169210.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list