[BioC] ReportingTools error
Gordon K Smyth
smyth at wehi.EDU.AU
Sun Apr 14 10:33:57 CEST 2013
Hi Jim and Jason,
As Jim says, I didn't intend topTable() and topTableF() to be different in
terms of row.names. The topTableF() function was written later, and I was
more careful to carry through the official row names of the data object
than when I wrote topTable(). I haven't fixed this discordance because it
isn't clear to me which way is actually better.
If lmFit() worked only on ExpressionSet objects, then it would be
straightforward. All ExpressionSet objects have unique row names, which
could be percolated through to the fitted MArrayLM object and to the
topTable output.
However lmFit() is designed to work on plain matrices and other types of
objects. A matrix might not have row names set and, if it has, the row
names do not have to be unique. Hence it is impossible to require in
general that the MArrayLM and the topTable objects have the same row
names. If the matrix has duplicate row names, which is perfectly allowed,
then the MarrayLM will do so also, but topTable cannot have duplicate row
names because it is a data.frame.
When, lmFit() gets a matrix without rownames, it leaves the row names
NULL. When the fit object gets to topTable(), topTable() adds rownames
1:N (where N is the number of rows in the fit object).
When lmFit() gets a matrix with rownames, then it creates a data.frame
'genes' with column ID. This then becomes a column in the topTable()
output. The rownames are set to 1:N as well. Note that ID can have
duplicate values.
If lmFit() gets an ExpressionSet object with featureData, it copies all
the featureData into the 'genes' data.frame, which becomes part of the
topTable output.
If lmFit() gets an ExpressionSet object without featureData, then it
creates a data.frame with a column ID containing the row.names, just as it
does for matrices. This time ID cannot have duplicate values.
Note that lmFit() creates a column ID in the 'genes' data.frame only when
row.names are provided but no other valid probe annotation is available.
If a data.frame of annotation is provided, then lmFit() simply uses the
columns that are provided. This is why a column ID is sometimes but not
always available.
Here is my tentative solution:
1. lmFit() will start enforcing unique row names in MArrayLM objects. If
no row names are provided, then row names will be set to 1:N. If the row
names of the expression matrix are not unique, then they will be made
unique by makeUnique().
2. lmFit() will no longer create an annotation column called 'ID'.
Instead the relevant information will be held in the row.names.
2. topTable() will preserve the row.names found in the MArrayLM fit
object, same as topTableF does currently.
How does that sound?
Possible cons: when rownames are non-unique, the makeUnique() function
will create rownames previously unknown to the user. Also I wonder how
many people with miss the numerical row numbers currently output by
topTable?
Regards
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
http://www.statsci.org/smyth
On Fri, 12 Apr 2013, Jason Hackney wrote:
> Hi Jim,
>
> My concern about relying on the ID field is that it isn't always there. For
> instance, when I add featureData to an eSet, I almost always specify that
> the ID is a ProbeID for a microarray or a GeneID if I'm using some other
> identifier as my featureNames. When lmFit is called, the genes data.frame
> now doesn't have an ID column.
>
> What I might do is try to detect the ID column in either case, and use it
> if it's present.
>
> I expect that when/if topTableF and topTable are concordant in their
> row.names I'll know about because one of my unit tests will fail because
> they are expected to be discordant.
>
> Cheers,
>
> Jason
>
> On Fri, Apr 12, 2013 at 6:33 AM, James W. MacDonald <jmacdon at uw.edu> wrote:
>
>> Hi Jason,
>>
>> I see the same thing - I had an email exchange with Gordon back in
>> February and he agreed that the row.names of the output from topTable and
>> topTableF should be the same thing, and it looked like he was leaning
>> towards using the row numbers. Given the speed with which he updates things
>> in limma, I assumed this happened approximately 13 nanoseconds later, but
>> evidently it either fell through the cracks or he had a change of mind
>> (Gordon is cc'ed).
>>
>> But I wonder if the ID column is a better way to go anyway.
>>
>> Gordon - what is the safest way to use data from either topTable or
>> topTableF to extract the corresponding raw data from the input object? Is
>> the ID column guaranteed to always correspond to the row.names or
>> featureNames of the data passed into lmFit?
>>
>> Best,
>>
>> Jim
>>
>>> sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] ReportingTools_2.1.2 knitr_1.2
>> [3] lattice_0.20-15 affycoretools_1.32.0
>> [5] KEGG.db_2.9.0 GO.db_2.9.0
>> [7] AnnotationDbi_1.22.1 affy_1.38.0
>> [9] pd.ragene.1.0.st.v1_3.8.0 RSQLite_0.11.2
>> [11] DBI_0.2-5 limma_3.16.1
>> [13] oligo_1.24.0 Biobase_2.20.0
>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0
>>
>> [snip]
>>
>>
>>
>>
>> On 4/11/2013 9:24 PM, Jason Hackney wrote:
>>
>>> Hi Jim,
>>>
>>> Could you send me your sessionInfo? I'm having trouble replicating this
>>> bug. I'm still getting probe names for topTableF and row numbers for
>>> topTable, as of limma_3.16.1. I'll pop in a bug fix to the ReportingTools
>>> trunk tomorrow, once I get the limma version sorted.
>>>
>>> Thanks,
>>>
>>> Jason
>>>
>>> On Thu, Apr 11, 2013 at 11:05 AM, James W. MacDonald <jmacdon at uw.edu<mailto:
>>> jmacdon at uw.edu>> wrote:
>>>
>>> Hi,
>>>
>>> I am getting an error when trying to create HTML pages with
>>> ReportingTools, using an MArrayLM object as input. The error I get is
>>>
>>> Error in expression.dat[probe, ] : subscript out of bounds
>>>
>>> which appears to come from .make.gene.plots(), specifically here:
>>>
>>> for (probe in rownames(df)) {
>>> if ("Symbol" %in% colnames(df)) {
>>> ylab <- paste(df[probe, "Symbol"], ylab.type)
>>> }
>>> else {
>>> ylab <- paste(probe, ylab.type)
>>> }
>>> bigplot <- stripplot(expression.dat[**probe, ] ~ factor,
>>>
>>> The problem being that the rownames for a topTable object will be
>>> the row numbers of the MArrayLM object from whence the data came
>>> (this was recently harmonized by Gordon Smyth, so the row.names
>>> will always be the row number, regardless of using topTable() or
>>> topTableF()).
>>>
>>> In other words, it appears that probe is assumed to be the row
>>> name, when in fact it will be the row number. So something like
>>>
>>> for(probe in as.numeric(rownames(df))){
>>>
>>> should do the trick.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>> -- James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>>
>>>
>>>
>>>
>>> --
>>> Jason A. Hackney, Ph.D.
>>> Bioinformatics and Computational Biology
>>> Genentech
>>> hackney.jason at gene.com <mailto:hackney.jason at gene.com**>
>>> 650-467-5084
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list