[R] Function read.table(…) reads in only 40% of a my table's lines

Asis Hallab asis.hallab at gmail.com
Mon Aug 5 13:11:36 CEST 2013


Dear R experts,

I have a large table saved in a file called "plant_genome.gff". The
file has 481848 lines in nine columns, which are TAB delimited, and is
53 MegaBytes large.
For anyone who might know the GFF3 format: The table holds a plant
genome's annotation.

If I read in the table with
read.table( "plant_genome.gff" )
I get the following error
"line 2 did not have 12 elements".

If I read in the table with
read.table( "plant_genome.gff", sep="\t" )
no error or warning is given, but my resulting table has only 193547
instead of the expected 481848 rows! 60% of the lines are omitted.

Also passing in the arguments
as.is = TRUE
or setting the columns' classes with
colClasses = c( "character", …, "integer", "integer", "numeric",
"character", … )
   # columns 4, and 5 are integers, column 6 is numeric, all others
are characters
does not resolve the problem.

If I read in the file with readLines and then manually split them using
strplit(…)
and combine them into a data.frame with
as.data.frame( do.call( "rbind", splitted.lines ), colClasses=…)
I get the expected and correct data.frame, representing my GFF3 data.

My questions are:
1) Am I using read.table wrong, or did I miss something in the documentation?
2) Or is this is known problem with large TAB delimited tables, whose
columns contain white-spaces and are not surrounded by quotes?

Unfortunately due to the unpublished nature of the plant genome I am
not allowed to give access to the GFF table that causes this problem.

Any ideas, hints, help - or comments on my stupidity having missed
something important - will be much appreciated!

Cheers!



More information about the R-help mailing list