[R] Why do I have a column called row.names?

Marc Schwartz marc_schwartz at me.com
Mon Jun 4 21:12:39 CEST 2012


To jump into the fray, he really needs to read the Details section of ?read.table and arguably, the source code for read.table().

It is not that the resultant data frame has row names, but that an additional first *column name* called 'row.names' is created, which does not exist in the source data.

The Details section has:

If row.names is not specified and the header line has one less entry than the number of columns, the first column is taken to be the row names. This allows data frames to be read in from the format in which they are printed. If row.names is specified and does not refer to the first column, that column is discarded from such files.

The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).


In the source code for read.table(), which is called by read.delim() with differing defaults, there is:

  rlabp <- (cols - col1) == 1L

and a few lines further down:

  if (rlabp) 
    col.names <- c("row.names", col.names)

So the last code snippet is where a new first column name called 'row.names' is pre-pended to the column names found from reading the header row. 'cols' and 'col1' are defined in prior code based upon various conditions.

Not having the full data set and possibly having line wrap and TAB problems with the text that Ed pasted into his original post, I cannot properly replicate the conditions that cause the above code to be triggered. 

If Ed can put the entire file someplace and provide a URL for download, perhaps we can better trace the source of the problem, or Ed might use ?debug to follow the code execution in read.table() and see where the relevant flags get triggered. The latter option would help Ed learn how to use the debugging tools that R provides to dig more deeply into such issues.

Regards,

Marc Schwartz


On Jun 4, 2012, at 1:30 PM, Bert Gunter wrote:

> Actually, I think it's ?data.frame that he should read.
> 
> The salient points are that:
> 1. All data frames must have unique row names. If not provided, they
> are produced. Row numbers **are** row names.
> 
> 2. The return value of read methods are data frames.
> 
> -- Bert
> 
> On Mon, Jun 4, 2012 at 11:05 AM, David L Carlson <dcarlson at tamu.edu> wrote:
>> Try help("read.delim") - always a good strategy before using a function for
>> the first time:
>> 
>> In it, you will find: "Using row.names = NULL forces row numbering. Missing
>> or NULL row.names generate row names that are considered to be 'automatic'
>> (and not preserved by as.matrix)."
>> 
>> ----------------------------------------------
>> David L Carlson
>> Associate Professor of Anthropology
>> Texas A&M University
>> College Station, TX 77843-4352
>> 
>> 
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>> project.org] On Behalf Of Ed Siefker
>>> Sent: Monday, June 04, 2012 12:47 PM
>>> To: r-help at r-project.org
>>> Subject: [R] Why do I have a column called row.names?
>>> 
>>> I'm trying to read in a tab separated table with read.delim().
>>> I don't particularly care what the row names are.
>>> My data file looks like this:
>>> 
>>> 
>>> start   stop    Symbol  Insert sequence Clone End Pair  FISH
>>> 203048  67173930        ABC8-43024000D23                TI:993812543
>>>  TI:993834585
>>> 255176  87869359        ABC8-43034700N15                TI:995224581
>>>  TI:995237913
>>> 1022033 1060472 ABC27-1253C21           TI:2094436044   TI:2094696079
>>> 1022033 1061172 ABC23-1388A1            TI:2120730727   TI:2121592459
>>> 
>>> 
>>> 
>>> I have to do something with row.names because my first column has
>>> duplicate entries.  So I read in the file like this:
>>> 
>>>> BACS<-read.delim("testdata.txt", row.names=NULL, fill=TRUE)
>>>> head(BACS)
>>>   row.names    start             stop Symbol Insert.sequence
>>> Clone.End.Pair
>>> 1    203048 67173930 ABC8-43024000D23     NA    TI:993812543
>>> TI:993834585
>>> 2    255176 87869359 ABC8-43034700N15     NA    TI:995224581
>>> TI:995237913
>>> 3   1022033  1060472    ABC27-1253C21     NA   TI:2094436044
>>> TI:2094696079
>>> 4   1022033  1061172     ABC23-1388A1     NA   TI:2120730727
>>> TI:2121592459
>>>   FISH
>>> 1   NA
>>> 2   NA
>>> 3   NA
>>> 4   NA
>>> 
>>> 
>>> Why is there a column named "row.names"?  I've tried a few different
>>> ways of invoking this, but I always get the first column named
>>> row.names,
>>> and the rest of the columns shifted by one.
>>> 
>>> Obviously I could fix this by using row.names<-, but I'd like to
>>> understand
>>> why this happens.  Any insight?



More information about the R-help mailing list