[R] Why do I have a column called row.names?
Brian Diggs
diggsb at ohsu.edu
Mon Jun 4 22:12:54 CEST 2012
On 6/4/2012 12:12 PM, Marc Schwartz wrote:
> To jump into the fray, he really needs to read the Details section of
> ?read.table and arguably, the source code for read.table().
>
> It is not that the resultant data frame has row names, but that an
> additional first *column name* called 'row.names' is created, which
> does not exist in the source data.
>
> The Details section has:
>
> If row.names is not specified and the header line has one less entry
> than the number of columns, the first column is taken to be the row
> names. This allows data frames to be read in from the format in which
> they are printed. If row.names is specified and does not refer to the
> first column, that column is discarded from such files.
>
> The number of data columns is determined by looking at the first five
> lines of input (or the whole file if it has less than five lines), or
> from the length of col.names if it is specified and is longer. This
> could conceivably be wrong if fill or blank.lines.skip are true, so
> specify col.names if necessary (as in the ‘Examples’).
>
>
> In the source code for read.table(), which is called by read.delim()
> with differing defaults, there is:
>
> rlabp<- (cols - col1) == 1L
>
> and a few lines further down:
>
> if (rlabp) col.names<- c("row.names", col.names)
>
> So the last code snippet is where a new first column name called
> 'row.names' is pre-pended to the column names found from reading the
> header row. 'cols' and 'col1' are defined in prior code based upon
> various conditions.
>
> Not having the full data set and possibly having line wrap and TAB
> problems with the text that Ed pasted into his original post, I
> cannot properly replicate the conditions that cause the above code to
> be triggered.
>
> If Ed can put the entire file someplace and provide a URL for
> download, perhaps we can better trace the source of the problem, or
> Ed might use ?debug to follow the code execution in read.table() and
> see where the relevant flags get triggered. The latter option would
> help Ed learn how to use the debugging tools that R provides to dig
> more deeply into such issues.
I agree that the actual file would be helpful. But I can get it to
happen if there are extra delimiters at the end of the data lines (which
there can be with a separator of tab which is not obviously visible). I
can get it with:
BACS<-read.delim(textConnection(
"start\tstop\tSymbol\tInsert sequence\tClone End Pair\tFISH
203048\t67173930\t\tABC8-43024000D23\tTI:993812543\tTI:993834585\t
255176\t87869359\t\tABC8-43034700N15\tTI:995224581\tTI:995237913\t
1022033\t1060472\t\tABC27-1253C21\tTI:2094436044\tTI:2094696079\t
1022033\t1061172\t\tABC23-1388A1\tTI:2120730727\tTI:2121592459\t"),
row.names=NULL, fill=TRUE)
which gives
> BACS
row.names start stop Symbol Insert.sequence
1 203048 67173930 NA ABC8-43024000D23 TI:993812543
2 255176 87869359 NA ABC8-43034700N15 TI:995224581
3 1022033 1060472 NA ABC27-1253C21 TI:2094436044
4 1022033 1061172 NA ABC23-1388A1 TI:2120730727
Clone.End.Pair FISH
1 TI:993834585 NA
2 TI:995237913 NA
3 TI:2094696079 NA
4 TI:2121592459 NA
or
> str(BACS)
'data.frame': 4 obs. of 7 variables:
$ row.names : chr "203048" "255176" "1022033" "1022033"
$ start : int 67173930 87869359 1060472 1061172
$ stop : logi NA NA NA NA
$ Symbol : Factor w/ 4 levels "ABC23-1388A1",..: 3 4 2 1
$ Insert.sequence: Factor w/ 4 levels "TI:2094436044",..: 3 4 1 2
$ Clone.End.Pair : Factor w/ 4 levels "TI:2094696079",..: 3 4 1 2
$ FISH : logi NA NA NA NA
The extra delimiter at the end of the line triggers the
one-more-data-than-column-name condition, which then gives the row.names
column.
> Regards,
>
> Marc Schwartz
>
>
> On Jun 4, 2012, at 1:30 PM, Bert Gunter wrote:
>
>> Actually, I think it's ?data.frame that he should read.
>>
>> The salient points are that:
>> 1. All data frames must have unique row names. If not provided, they
>> are produced. Row numbers **are** row names.
>>
>> 2. The return value of read methods are data frames.
>>
>> -- Bert
>>
>> On Mon, Jun 4, 2012 at 11:05 AM, David L Carlson<dcarlson at tamu.edu> wrote:
>>> Try help("read.delim") - always a good strategy before using a function for
>>> the first time:
>>>
>>> In it, you will find: "Using row.names = NULL forces row numbering. Missing
>>> or NULL row.names generate row names that are considered to be 'automatic'
>>> (and not preserved by as.matrix)."
>>>
>>> ----------------------------------------------
>>> David L Carlson
>>> Associate Professor of Anthropology
>>> Texas A&M University
>>> College Station, TX 77843-4352
>>>
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>>> project.org] On Behalf Of Ed Siefker
>>>> Sent: Monday, June 04, 2012 12:47 PM
>>>> To: r-help at r-project.org
>>>> Subject: [R] Why do I have a column called row.names?
>>>>
>>>> I'm trying to read in a tab separated table with read.delim().
>>>> I don't particularly care what the row names are.
>>>> My data file looks like this:
>>>>
>>>>
>>>> start stop Symbol Insert sequence Clone End Pair FISH
>>>> 203048 67173930 ABC8-43024000D23 TI:993812543
>>>> TI:993834585
>>>> 255176 87869359 ABC8-43034700N15 TI:995224581
>>>> TI:995237913
>>>> 1022033 1060472 ABC27-1253C21 TI:2094436044 TI:2094696079
>>>> 1022033 1061172 ABC23-1388A1 TI:2120730727 TI:2121592459
>>>>
>>>>
>>>>
>>>> I have to do something with row.names because my first column has
>>>> duplicate entries. So I read in the file like this:
>>>>
>>>>> BACS<-read.delim("testdata.txt", row.names=NULL, fill=TRUE)
>>>>> head(BACS)
>>>> row.names start stop Symbol Insert.sequence
>>>> Clone.End.Pair
>>>> 1 203048 67173930 ABC8-43024000D23 NA TI:993812543
>>>> TI:993834585
>>>> 2 255176 87869359 ABC8-43034700N15 NA TI:995224581
>>>> TI:995237913
>>>> 3 1022033 1060472 ABC27-1253C21 NA TI:2094436044
>>>> TI:2094696079
>>>> 4 1022033 1061172 ABC23-1388A1 NA TI:2120730727
>>>> TI:2121592459
>>>> FISH
>>>> 1 NA
>>>> 2 NA
>>>> 3 NA
>>>> 4 NA
>>>>
>>>>
>>>> Why is there a column named "row.names"? I've tried a few different
>>>> ways of invoking this, but I always get the first column named
>>>> row.names,
>>>> and the rest of the columns shifted by one.
>>>>
>>>> Obviously I could fix this by using row.names<-, but I'd like to
>>>> understand
>>>> why this happens. Any insight?
>
--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University
More information about the R-help
mailing list