[R] Way to handle variable length and numbers of columns using read.table(...)

Jason Rupert jasonkrupert at yahoo.com
Tue May 5 05:29:02 CEST 2009


Jim, 

You guessed it.  There are other "problems" with the data.  Here is a closer representation of the data:
Total time and location 
are listed below.

Time Loc1 Loc2
---------------
1 22.33 44.55
2 66.77 88.99
3 222.33344.55
4 66.77 88.99

Avg. Loc1 = 77.88
Avg. Loc2 = 55.66
Final Time = 4

Right now I am using "nrows" in order to only read Time 1-4 & "skip" to skip over the unusable header info, e.g.

read.table(read.table('clipboard', header=FALSE, fill=TRUE, skip=5, nrows=4)

Unfortunately, sometimes the number of "Time" rows varies, so I need to also account for that.  

Maybe I need to look into what Gabor suggested as well, i.e. library(gsubfn)

Thanks again for any feedback and advice on this one, as the data I receive is out of my control, but I am working with the go get them to fix it as well. 



--- On Mon, 5/4/09, jim holtman <jholtman at gmail.com> wrote:

> From: jim holtman <jholtman at gmail.com>
> Subject: Re: [R] Way to handle variable length and numbers of columns using  read.table(...)
> To: jasonkrupert at yahoo.com
> Cc: R-help at r-project.org
> Date: Monday, May 4, 2009, 9:47 PM
> Well if you read in your data, you get:
> 
> > x <- read.table('clipboard', header=TRUE,
> fill=TRUE)
> Warning message:
> In read.table("clipboard", header = TRUE, fill =
> TRUE) :
>   incomplete final line found by readTableHeader on
> 'clipboard'
> > x
>   Time         Loc1  Loc2
> 1    1        22.33 44.55
> 2    2        66.77 88.99
> 3    3 222.33344.55    NA
> 4    4        66.77 88.99
> > str(x)
> 'data.frame':   4 obs. of  3 variables:
>  $ Time: int  1 2 3 4
>  $ Loc1: Factor w/ 3 levels
> "22.33","222.33344.55",..: 1 3 2 3
>  $ Loc2: num  44.5 89 NA 89
> >

As you can see the variable that has two decimal points is read in as a character and cause the whole column to be converted to a factor.  It appears that you have some fixed length fields that are overflowing.  Now you could read in the data and use regular expressions and parse the data; you just have to match on the first part have two decimal place and then extract the rest.  THe question is, is this the only "problems" you have in the data?  If so, parsing it is not hard.
 
> On Mon, May 4, 2009 at 10:20 PM, Jason Rupert
> <jasonkrupert at yahoo.com>wrote:
> 
> >
> > I've got read.table to successfully read in my
> table of three columns.
> >  Most of the time I will have a set number of rows,
> but sometime that will
> > be variable and sometimes there will be only be two
> variables in one row,
> > e.g.
> >
> > Time Loc1 Loc2
> > 1 22.33 44.55
> > 2 66.77 88.99
> > 3 222.33344.55
> > 4 66.77 88.99
> >
> > Is there any way to have read.table handle (1) a
> variable number of rows,
> > and (2) sometime there are only two variables as shown
> in Time = 3 above?
> >
> > Just curious about how to handle this, and if
> read.table is the right way
> > to go about or if I should read in all the data and
> then try to parse it out
> > best I can.
> >
> > Thanks again.
> >
> > > R.version
> >               _
> > platform       i386-apple-darwin8.11.1
> > arch           i386
> > os             darwin8.11.1
> > system         i386, darwin8.11.1
> > status
> > major          2
> > minor          8.0
> > year           2008
> > month          10
> > day            20
> > svn rev        46754
> > language       R
> > version.string R version 2.8.0 (2008-10-20)
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> >
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> > and provide commented, minimal, self-contained,
> reproducible code.
> >
> 
> 
> 
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
> 
> What is the problem that you are trying to solve?




More information about the R-help mailing list