Hi Simon, Hi Naomi,
thanks for your feedback.

The filtering removes a really nice chunk of my features:
> use <- (rs > quantile(rs, 0.4))
> quantile(rs,0.4)
40%
23
> table(use)
use
FALSE TRUE
28383 42190

So after filtering my raw pvalues have two peaks in the histogram. One at
p=0 and one at p=1. The "0" peak is half the size of "1". And some in
between starting at half of "0" and ending at half of "1". So this tells me
I should stay with Benjamini-Hochberg I guess. I would be really interested
in your dissertation Naomi as I have still have a lot to learn here.
As I understand my filtering cutoff the 0.4 quantile of my data is a sum of
the rows <23. I guess thats fine.

I agree with Simon that my experiment is flawed. And it makes sense to not
try to optimize the statistics on flawed data. So asterisk-hunting is not
really advisable. But as we were limited in sample AND sequencing
capability thats what we did. Even if it is not perfect and is more of a
exploratory study the results are quite interesting and a good starting
point for further (and better) experiments.
I was part interested in the inner workings of the statistical analysis of
such data and if I could maybe improve my analysis a bit. But it seems I
first have to improve the experiment first. :-)

Thanks for your time.

Cheers,

Markus


Am 26. November 2011 20:45 schrieb Naomi Altman <naomi@stat.psu.edu>:

> I am doing research on the use of FDR methods with count data.
>
> Filtering definitely helps.  You want to remove features which have so few
> counts that you cannot achieve statistical significance even if all the
> reads come from 1 condition.  This is a bit complicated to determine using
> DESeq due to the dispersion shrinkage, but 10 to 20 are probably good
> cut-offs.
>
> Storey's method works well with count data if the estimate of pi_0 is OK.
>  To determine this, draw a histogram of the raw p-values (from the filtered
> data).  There should be a single peak near p=0.  If there is another peak
> near p=1, then Storey's method does not work so well.  The Benjamini and
> Hochberg method is more conservative, but it at least works.
>
> The dissertation on which my comments are based should be available by the
> end of January.  I will post a link as soon as I am able.
>
> Naomi
>
>
> At 02:05 PM 11/25/2011, Simon Anders wrote:
>
>> Dear Markus,
>>
>> there are several questions in your mail; I try to answer them separately.
>>
>> 1. Storey's qvalues: While, technically, the applicability of Storey's
>> method might be a bit more narrow that of Benjamini and Hochberg's, within
>> transcriptomics both are usually equally applicable, and in, Storey's does
>> give more results.
>>
>> Internally, DESeq calculates the adjusted p values with something like
>>
>>  res$padj <- p.adjust( res$pval, method="BH" )
>>
>> You can also convert the raw p values (res$pval) yourself with Storey's
>> package if you have it installed. Beware that it does not handle NAs well,
>> you may need to take out the NA p values and put them back in.
>>
>> 2. Independent filtering: In the newest version of the DESeq voignette,
>> we have added a section on independent filtering. Removing, e.g., all genes
>> with, say, an average count below 10 does give you some extra hits.
>>
>> 3. The real reason that you have so few hits is your lack of replicates.
>> In this situation, DESeq reports by design only those hits that are
>> strikingly obvious, and doing otherwise wih a sound analysis method is
>> impossible. You cannot expect to get useful results with a flawed
>> experimental design -- and while the two points above might give you a few
>> extra hit, you are unlikely to get usable result without fixing your
>> experiment.
>>
>> 4. Sequencing depth: Remember that it is the total number of counts per
>> gene and _condition_ (not: sample) that gives you power for weakly
>> expressed genes, and the number of replicates that gives your power for the
>> strongly expressed genes. Hence, whenever practically feasible, it is
>> always better to sequence many biological replicate samples to moderate
>> depth than to sequence a few samples very deeply. (Of course, even if
>> replicates are difficult to obtain, two replicates is the minimum. Doing an
>> experiment without that is pointless.)
>>
>>  Simon
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

	[[alternative HTML version deleted]]