[R] Why do I have a column called row.names?

Mon Jun 4 22:12:54 CEST 2012

On 6/4/2012 12:12 PM, Marc Schwartz wrote:
> To jump into the fray, he really needs to read the Details section of
> ?read.table and arguably, the source code for read.table().
>
> It is not that the resultant data frame has row names, but that an
> additional first *column name* called 'row.names' is created, which
> does not exist in the source data.
>
> The Details section has:
>
> If row.names is not specified and the header line has one less entry
> than the number of columns, the first column is taken to be the row
> names. This allows data frames to be read in from the format in which
> they are printed. If row.names is specified and does not refer to the
> first column, that column is discarded from such files.
>
> The number of data columns is determined by looking at the first five
> lines of input (or the whole file if it has less than five lines), or
> from the length of col.names if it is specified and is longer. This
> could conceivably be wrong if fill or blank.lines.skip are true, so
> specify col.names if necessary (as in the ‘Examples’).
>
>
> In the source code for read.table(), which is called by read.delim()
> with differing defaults, there is:
>
> rlabp<- (cols - col1) == 1L
>
> and a few lines further down:
>
> if (rlabp) col.names<- c("row.names", col.names)
>
> So the last code snippet is where a new first column name called
> 'row.names' is pre-pended to the column names found from reading the
> header row. 'cols' and 'col1' are defined in prior code based upon
> various conditions.
>
> Not having the full data set and possibly having line wrap and TAB
> problems with the text that Ed pasted into his original post, I
> cannot properly replicate the conditions that cause the above code to
> be triggered.
>
> If Ed can put the entire file someplace and provide a URL for
> download, perhaps we can better trace the source of the problem, or
> Ed might use ?debug to follow the code execution in read.table() and
> see where the relevant flags get triggered. The latter option would
> help Ed learn how to use the debugging tools that R provides to dig
> more deeply into such issues.

I agree that the actual file would be helpful. But I can get it to 
happen if there are extra delimiters at the end of the data lines (which 
there can be with a separator of tab which is not obviously visible).  I 
can get it with:

BACS<-read.delim(textConnection(
"start\tstop\tSymbol\tInsert sequence\tClone End Pair\tFISH
203048\t67173930\t\tABC8-43024000D23\tTI:993812543\tTI:993834585\t
255176\t87869359\t\tABC8-43034700N15\tTI:995224581\tTI:995237913\t
1022033\t1060472\t\tABC27-1253C21\tTI:2094436044\tTI:2094696079\t
1022033\t1061172\t\tABC23-1388A1\tTI:2120730727\tTI:2121592459\t"),
                  row.names=NULL, fill=TRUE)

which gives

 > BACS
   row.names    start stop           Symbol Insert.sequence
1    203048 67173930   NA ABC8-43024000D23    TI:993812543
2    255176 87869359   NA ABC8-43034700N15    TI:995224581
3   1022033  1060472   NA    ABC27-1253C21   TI:2094436044
4   1022033  1061172   NA     ABC23-1388A1   TI:2120730727
   Clone.End.Pair FISH
1   TI:993834585   NA
2   TI:995237913   NA
3  TI:2094696079   NA
4  TI:2121592459   NA

or

 > str(BACS)
'data.frame':	4 obs. of  7 variables:
  $ row.names      : chr  "203048" "255176" "1022033" "1022033"
  $ start          : int  67173930 87869359 1060472 1061172
  $ stop           : logi  NA NA NA NA
  $ Symbol         : Factor w/ 4 levels "ABC23-1388A1",..: 3 4 2 1
  $ Insert.sequence: Factor w/ 4 levels "TI:2094436044",..: 3 4 1 2
  $ Clone.End.Pair : Factor w/ 4 levels "TI:2094696079",..: 3 4 1 2
  $ FISH           : logi  NA NA NA NA

The extra delimiter at the end of the line triggers the 
one-more-data-than-column-name condition, which then gives the row.names 
column.

> Regards,
>
> Marc Schwartz
>
>
> On Jun 4, 2012, at 1:30 PM, Bert Gunter wrote:
>
>> Actually, I think it's ?data.frame that he should read.
>>
>> The salient points are that:
>> 1. All data frames must have unique row names. If not provided, they
>> are produced. Row numbers **are** row names.
>>
>> 2. The return value of read methods are data frames.
>>
>> -- Bert
>>
>> On Mon, Jun 4, 2012 at 11:05 AM, David L Carlson<dcarlson at tamu.edu>  wrote:
>>> Try help("read.delim") - always a good strategy before using a function for
>>> the first time:
>>>
>>> In it, you will find: "Using row.names = NULL forces row numbering. Missing
>>> or NULL row.names generate row names that are considered to be 'automatic'
>>> (and not preserved by as.matrix)."
>>>
>>> ----------------------------------------------
>>> David L Carlson
>>> Associate Professor of Anthropology
>>> Texas A&M University
>>> College Station, TX 77843-4352
>>>
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>>> project.org] On Behalf Of Ed Siefker
>>>> Sent: Monday, June 04, 2012 12:47 PM
>>>> To: r-help at r-project.org
>>>> Subject: [R] Why do I have a column called row.names?
>>>>
>>>> I'm trying to read in a tab separated table with read.delim().
>>>> I don't particularly care what the row names are.
>>>> My data file looks like this:
>>>>
>>>>
>>>> start   stop    Symbol  Insert sequence Clone End Pair  FISH
>>>> 203048  67173930        ABC8-43024000D23                TI:993812543
>>>>   TI:993834585
>>>> 255176  87869359        ABC8-43034700N15                TI:995224581
>>>>   TI:995237913
>>>> 1022033 1060472 ABC27-1253C21           TI:2094436044   TI:2094696079
>>>> 1022033 1061172 ABC23-1388A1            TI:2120730727   TI:2121592459
>>>>
>>>>
>>>>
>>>> I have to do something with row.names because my first column has
>>>> duplicate entries.  So I read in the file like this:
>>>>
>>>>> BACS<-read.delim("testdata.txt", row.names=NULL, fill=TRUE)
>>>>> head(BACS)
>>>>    row.names    start             stop Symbol Insert.sequence
>>>> Clone.End.Pair
>>>> 1    203048 67173930 ABC8-43024000D23     NA    TI:993812543
>>>> TI:993834585
>>>> 2    255176 87869359 ABC8-43034700N15     NA    TI:995224581
>>>> TI:995237913
>>>> 3   1022033  1060472    ABC27-1253C21     NA   TI:2094436044
>>>> TI:2094696079
>>>> 4   1022033  1061172     ABC23-1388A1     NA   TI:2120730727
>>>> TI:2121592459
>>>>    FISH
>>>> 1   NA
>>>> 2   NA
>>>> 3   NA
>>>> 4   NA
>>>>
>>>>
>>>> Why is there a column named "row.names"?  I've tried a few different
>>>> ways of invoking this, but I always get the first column named
>>>> row.names,
>>>> and the rest of the columns shifted by one.
>>>>
>>>> Obviously I could fix this by using row.names<-, but I'd like to
>>>> understand
>>>> why this happens.  Any insight?
>

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University