[R] Function read.table(…) reads in only 40% of a my table's lines
David Winsemius
dwinsemius at comcast.net
Mon Aug 5 18:20:16 CEST 2013
On Aug 5, 2013, at 4:11 AM, Asis Hallab wrote:
> Dear R experts,
>
> I have a large table saved in a file called "plant_genome.gff". The
> file has 481848 lines in nine columns, which are TAB delimited, and is
> 53 MegaBytes large.
> For anyone who might know the GFF3 format: The table holds a plant
> genome's annotation.
>
> If I read in the table with
> read.table( "plant_genome.gff" )
> I get the following error
> "line 2 did not have 12 elements".
>
> If I read in the table with
> read.table( "plant_genome.gff", sep="\t" )
> no error or warning is given, but my resulting table has only 193547
> instead of the expected 481848 rows! 60% of the lines are omitted.
>
> Also passing in the arguments
> as.is = TRUE
> or setting the columns' classes with
> colClasses = c( "character", …, "integer", "integer", "numeric",
> "character", … )
> # columns 4, and 5 are integers, column 6 is numeric, all others
> are characters
> does not resolve the problem.
>
> If I read in the file with readLines and then manually split them using
> strplit(…)
THat doesn't unambiguously define the process.
> and combine them into a data.frame with
> as.data.frame( do.call( "rbind", splitted.lines ), colClasses=…)
> I get the expected and correct data.frame, representing my GFF3 data.
>
> My questions are:
> 1) Am I using read.table wrong, or did I miss something in the documentation?
> 2) Or is this is known problem with large TAB delimited tables, whose
> columns contain white-spaces and are not surrounded by quotes?
I would think this is not "a known problem" but rather "entirely expected and documented behavior". The read.table function uses white-space as its default separation rule. The large-ness of the file has nothing to do with it. You would get the same problem with a very small example. If you want tab-separation then use read.delim which has sep="\t" as its default.
--
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list