[BioC] ReportingTools error

Sun Apr 14 10:33:57 CEST 2013

Hi Jim and Jason,

As Jim says, I didn't intend topTable() and topTableF() to be different in 
terms of row.names.  The topTableF() function was written later, and I was 
more careful to carry through the official row names of the data object 
than when I wrote topTable().  I haven't fixed this discordance because it 
isn't clear to me which way is actually better.

If lmFit() worked only on ExpressionSet objects, then it would be 
straightforward.  All ExpressionSet objects have unique row names, which 
could be percolated through to the fitted MArrayLM object and to the 
topTable output.

However lmFit() is designed to work on plain matrices and other types of 
objects.  A matrix might not have row names set and, if it has, the row 
names do not have to be unique.  Hence it is impossible to require in 
general that the MArrayLM and the topTable objects have the same row 
names.  If the matrix has duplicate row names, which is perfectly allowed, 
then the MarrayLM will do so also, but topTable cannot have duplicate row 
names because it is a data.frame.

When, lmFit() gets a matrix without rownames, it leaves the row names 
NULL.  When the fit object gets to topTable(), topTable() adds rownames 
1:N (where N is the number of rows in the fit object).

When lmFit() gets a matrix with rownames, then it creates a data.frame 
'genes' with column ID.  This then becomes a column in the topTable() 
output.  The rownames are set to 1:N as well.  Note that ID can have 
duplicate values.

If lmFit() gets an ExpressionSet object with featureData, it copies all 
the featureData into the 'genes' data.frame, which becomes part of the 
topTable output.

If lmFit() gets an ExpressionSet object without featureData, then it 
creates a data.frame with a column ID containing the row.names, just as it 
does for matrices.  This time ID cannot have duplicate values.

Note that lmFit() creates a column ID in the 'genes' data.frame only when 
row.names are provided but no other valid probe annotation is available. 
If a data.frame of annotation is provided, then lmFit() simply uses the 
columns that are provided.  This is why a column ID is sometimes but not 
always available.

Here is my tentative solution:

1. lmFit() will start enforcing unique row names in MArrayLM objects.  If 
no row names are provided, then row names will be set to 1:N.  If the row 
names of the expression matrix are not unique, then they will be made 
unique by makeUnique().

2. lmFit() will no longer create an annotation column called 'ID'. 
Instead the relevant information will be held in the row.names.

2. topTable() will preserve the row.names found in the MArrayLM fit 
object, same as topTableF does currently.

How does that sound?

Possible cons:  when rownames are non-unique, the makeUnique() function 
will create rownames previously unknown to the user.  Also I wonder how 
many people with miss the numerical row numbers currently output by 
topTable?

Regards
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
http://www.statsci.org/smyth

On Fri, 12 Apr 2013, Jason Hackney wrote:

> Hi Jim,
>
> My concern about relying on the ID field is that it isn't always there. For
> instance, when I add featureData to an eSet, I almost always specify that
> the ID is a ProbeID for a microarray or a GeneID if I'm using some other
> identifier as my featureNames. When lmFit is called, the genes data.frame
> now doesn't have an ID column.
>
> What I might do is try to detect the ID column in either case, and use it
> if it's present.
>
> I expect that when/if topTableF and topTable are concordant in their
> row.names I'll know about because one of my unit tests will fail because
> they are expected to be discordant.
>
> Cheers,
>
> Jason
>
> On Fri, Apr 12, 2013 at 6:33 AM, James W. MacDonald <jmacdon at uw.edu> wrote:
>
>> Hi Jason,
>>
>> I see the same thing - I had an email exchange with Gordon back in
>> February and he agreed that the row.names of the output from topTable and
>> topTableF should be the same thing, and it looked like he was leaning
>> towards using the row numbers. Given the speed with which he updates things
>> in limma, I assumed this happened approximately 13 nanoseconds later, but
>> evidently it either fell through the cracks or he had a change of mind
>> (Gordon is cc'ed).
>>
>> But I wonder if the ID column is a better way to go anyway.
>>
>> Gordon - what is the safest way to use data from either topTable or
>> topTableF to extract the corresponding raw data from the input object? Is
>> the ID column guaranteed to always correspond to the row.names or
>> featureNames of the data passed into lmFit?
>>
>> Best,
>>
>> Jim
>>
>>> sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> [8] base
>>
>> other attached packages:
>>  [1] ReportingTools_2.1.2      knitr_1.2
>>  [3] lattice_0.20-15           affycoretools_1.32.0
>>  [5] KEGG.db_2.9.0             GO.db_2.9.0
>>  [7] AnnotationDbi_1.22.1      affy_1.38.0
>>  [9] pd.ragene.1.0.st.v1_3.8.0 RSQLite_0.11.2
>> [11] DBI_0.2-5                 limma_3.16.1
>> [13] oligo_1.24.0              Biobase_2.20.0
>> [15] oligoClasses_1.22.0       BiocGenerics_0.6.0
>>
>> [snip]
>>
>>
>>
>>
>> On 4/11/2013 9:24 PM, Jason Hackney wrote:
>>
>>> Hi Jim,
>>>
>>> Could you send me your sessionInfo? I'm having trouble replicating this
>>> bug. I'm still getting probe names for topTableF and row numbers for
>>> topTable, as of limma_3.16.1. I'll pop in a bug fix to the ReportingTools
>>> trunk tomorrow, once I get the limma version sorted.
>>>
>>> Thanks,
>>>
>>> Jason
>>>
>>> On Thu, Apr 11, 2013 at 11:05 AM, James W. MacDonald <jmacdon at uw.edu<mailto:
>>> jmacdon at uw.edu>> wrote:
>>>
>>>     Hi,
>>>
>>>     I am getting an error when trying to create HTML pages with
>>>     ReportingTools, using an MArrayLM object as input. The error I get is
>>>
>>>     Error in expression.dat[probe, ] : subscript out of bounds
>>>
>>>     which appears to come from .make.gene.plots(), specifically here:
>>>
>>>      for (probe in rownames(df)) {
>>>             if ("Symbol" %in% colnames(df)) {
>>>                 ylab <- paste(df[probe, "Symbol"], ylab.type)
>>>             }
>>>             else {
>>>                 ylab <- paste(probe, ylab.type)
>>>             }
>>>             bigplot <- stripplot(expression.dat[**probe, ] ~ factor,
>>>
>>>     The problem being that the rownames for a topTable object will be
>>>     the row numbers of the MArrayLM object from whence the data came
>>>     (this was recently harmonized by Gordon Smyth, so the row.names
>>>     will always be the row number, regardless of using topTable() or
>>>     topTableF()).
>>>
>>>     In other words, it appears that probe is assumed to be the row
>>>     name, when in fact it will be the row number. So something like
>>>
>>>     for(probe in as.numeric(rownames(df))){
>>>
>>>     should do the trick.
>>>
>>>     Best,
>>>
>>>     Jim
>>>
>>>
>>>
>>>     --     James W. MacDonald, M.S.
>>>     Biostatistician
>>>     University of Washington
>>>     Environmental and Occupational Health Sciences
>>>     4225 Roosevelt Way NE, # 100
>>>     Seattle WA 98105-6099
>>>
>>>
>>>
>>>
>>> --
>>> Jason A. Hackney, Ph.D.
>>> Bioinformatics and Computational Biology
>>> Genentech
>>> hackney.jason at gene.com <mailto:hackney.jason at gene.com**>
>>> 650-467-5084

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}