[BioC] stat/math question on Category vignette
James W. MacDonald
jmacdon at med.umich.edu
Thu Aug 23 16:12:49 CEST 2007
Hi Mark,
Mark W Kimpel wrote:
> I am working my way through the Category vignette and have a question as
> to how the t statistics for categories are computed from the incidence
> matrix and individual probeset t-statistics. The code that does this can
> be found on the bottom of page 3 (development version vignette) and is
> as follows:
>
> There are 135 pathways (categories)...
> A = AmER2 %*% tobs$statistic
> A = tA/sqrt(rs2)
> ames(tA) = row.names(AmER2)
Actually you have a typo here. It should read
tA = AmER2 %*% tobs$statistic
tA = tA/sqrt(rs2)
As for the computation being done here, it is actually very simple.
AmER2 is a matrix of dimension [npathways x nprobesets], where npathways
is the number of pathways you are interrogating, and nprobesets is the
number of probesets that remain after you do all the filtering steps
that preceded this part.
Each row of AmER2 consists of zeros and ones; a zero if the
corresponding probeset doesn't map to that particular pathway, and a one
if it does. By computing AmER2 %*% tobs$statistic, we are (in one shot)
doing the same as
apply(AmER2, 1, function(x) sum(tobs$statistic[as.logical(x)])
In other words, we are just summing for each row the t-statistics of the
probesets that are in a particular pathway. Since there will be a
different number of statistics that are being summed, we then divide by
sqrt(rs2), which is just the square root of the number of t-statistics
summed. We do this to normalize the sums.
>
> I know this is matrix multiplication, but don't know the mathematical or
> statistical basis for the computation. I am interested in turning the t
> statistic values in tA into p values, so I need to know the df. for each
> resultant t. Is that the rs2?
So to answer this question, the values in tA aren't t-statistics. They
are sums of t-statistics. If you look at the top of the page you are
quoting, you can see that if we make some assumptions, these values are
approximately multivariate normal, so you don't need to know the df.
If you don't want to assume multivariate normal, you can permute to get
the p-value as is done on page 6.
Best,
Jim
>
> This is know doubt a simple question for the statisticians in the group,
> but not for me! :) Thanks for your help,
>
> Mark
>
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
More information about the Bioconductor
mailing list