[BioC] Format problems

Thu Aug 14 13:52:38 MEST 2003

Right. We've been looking into it. This is a problem that has been seen before:
https://stat.ethz.ch/pipermail/bioconductor/2003-June/001681.html

what we think has been happening is this:
some of our expression levels (from MAS5.0) are zero. Which screws us up when we take log2... After some processing, we were ending up with some NaNs in our table. This means (as Vincent Carey pointed out) that when we pull out the column containing the NaNs on its own, R invokes 'make.names' on the row names of the data.frame in order to generate the names for the column. The result is that '1007_s_at' becomes 'X1007.s.at. If the column doesn't contain NaNs R doesn't invoke make.names - so we keep the correct name. Note that this doesn't appear to happen with matrices... 

Here is a minimal example showing what is going on:

>  t1 <-c(1,2,3,4,NaN)
>  t2 <-c(1,2,3,4,5)
>  names(t1)<-c("111_a","222_a","333_a","a_4","a_5");
>  names(t2)<-c("111_a","222_a","333_a","a_4","a_5");

>  t3 <- cbind(t1,t2)
>  t4 <- as.data.frame(t3)

>  t3[t3[,1]>0,]
      t1 t2
111_a  1  1
222_a  2  2
333_a  3  3
a_4    4  4
<NA>  NA NA
>  t4[t4[,1]>0,]
       t1 t2
X111.a  1  1
X222.a  2  2
X333.a  3  3
a.4     4  4
NA     NA NA
>  t3[t3[,2]>0,]
       t1 t2
111_a   1  1
222_a   2  2
333_a   3  3
a_4     4  4
a_5   NaN  5
>  t4[t4[,2]>0,]
       t1 t2
111_a   1  1
222_a   2  2
333_a   3  3
a_4     4  4
a_5   NaN  5
>

Sooo, if we select on a column containing NaN's in a data.frame, the returned row names have been changed, and the names of rows containing NA have also been replaced by 'NA'. Also, the NaNs become NAs in the returned columns. Matrices behave OK - but note that the rownames of rows containing an NaN become <NA>. 

For both datatypes, the NaNs become NA in the actual columns.

I guess the rowname problem is a conflict that arises particularly for Affy users in bioconductor because id's with '_'s in them are changed. It's certainly an issue people should be aware of - since it makes the names of probes in data.frames change according to the kind of numeric data they contain... (And also the actual values (e.g. NaN becomes NA).

Cheers,
Claire and Crispin

> -----Original Message-----
> From: Gordon Smyth [mailto:smyth at wehi.edu.au]
> Sent: 14 August 2003 11:19
> To: Claire Wilson
> Cc: BioC mailing list
> Subject: Re: [BioC] Format problems
> 
> 
> Claire,
> 
> The rules for conversion of column names for data.frames are 
> explained 
> under ?make.names. When you use read.table, the column 
> headings are passed 
> through make.names to ensure that they are syntactically 
> valid variable 
> names. If you don't want read.table to do this, then use the argument 
> check.names=FALSE.
> 
> Gordon
> 
> At 08:03 PM 14/08/2003, Claire Wilson wrote:
> >Dear all,
> >This is possibly more of an R question,but because it 
> involves dealing 
> >properly with Affy probeset identifiers i'm asking here...
> >
> >Can anyone explain the rules R uses to replace '_' 
> characters with '.'s? I 
> >am finding that columns in data frames are sometimes having 
> there rownames 
> >changed from '1007_s_at' to 'X1007.s.at' (for example, lines 
> 3-6 in the 
> >excerpt below). I am also seeing rownames that are being 
> repeated (last 2 
> >rownames printed out in lines3-6 in the excerpt below, even 
> though they 
> >should be unique. This seems to happen in data frames, but 
> not matrices. I 
> >think that it's probably an  internal representation I 
> should never get to 
> >see but I'm not sure
> >for example:
> >
> >I have 2 data.frames that contain fold changes and pscores 
> for a number of 
> >different experiments.  Each data frame has 6 columns fold change 1, 
> >p-score 1, fold change 2, p-score 2, fold change 3, p-score 
> 3 and the 
> >rownames are probeset identifiers.  I now have a function 
> that takes a 
> >pair of columns from each table, looks at what probesets 
> pass a certain 
> >p-score and fold change cutoff and which of these probesets 
> are shared by 
> >the 2 tables.  My problem is this, for the 1st 2 pairs of 
> columns (fold 
> >change 1, p-score 1, fold change 2, p-score 2) everything 
> works fine but 
> >when I try and compare columns 5 and 6 from each table, the 
> rownames for 
> >certain probesets are changed from the standard format into 
> one where they 
> >are prefixed by an X and the '_' replaced by a dot.  Putting 
> in print 
> >statements shows this
> >[1] "1007_s_at" "1053_at"   "117_at"    "121_at"    
> "1255_g_at" "1294_at" 
> >- rownames[1:6] table 1
> >[2] "1007_s_at" "1053_at"   "117_at"    "121_at"    
> "1255_g_at" "1294_at" 
> >- rownames[1:6] table 2
> >[3] "1007_s_at 1053_at X1007.s.at X1053.at X1053.at" - 
> rownames[1:6] table 
> >1 that pass a p-score cutoff
> >[4] "1053_at 121_at X1053.at X117.at X1255.g.at" - 
> rownames[1:6] table 2 
> >that pass a p-score cutoff
> >[5] "121_at 1320_at X121.at X1294.at X1294.at" - 
> rownames[1:6] table 1 
> >that pass a fold change cutoff
> >[6] "1729_at 1729_at X1294.at X1316.at X1316.at" - 
> rownames[1:6] table 2 
> >that pass a fold change cutoff
> >
> >...can anyone help me out with where I am going wrong or has 
> anyone come 
> >across similar issues (I am running the latest version of R and the 
> >Bioconductor packages).  The data frames passed to the 
> function are made 
> >by using cbind to join together different columns from 
> different data frames.
> >
> >Many thanks
> >
> >Claire
> >--
> >Claire Wilson, PhD
> >Bioinformatics group
> >Paterson Institute for Cancer Research
> >Christies Hospital NHS Trust
> >Wilmslow Road,
> >Withington
> >Manchester
> >M20 4BX
> >tel: +44 (0)161 446 8218
> >url: http://bioinf.picr.man.ac.uk/
> >
> >--------------------------------------------------------
> >
> >
> >This email is confidential and intended solely for the use 
> o...{{dropped}}
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at stat.math.ethz.ch
> >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
> 
>

--------------------------------------------------------

This email is confidential and intended solely for the use o...{{dropped}}