[BioC] edgeR for data combined from different studies and/or platforms

Tue Mar 19 11:13:45 CET 2013

Thank you for your thoughts.

I want to compare the two cancers to study cancer-specific
genes/pathways. Both are cancers of the same organ. There is no sample
that is common to both data-sets. The sets do include cancer-adjacent
normal tissue samples and I can examine the combined data to see if
the normal samples of one set are like from that of the other, though
one can question the underlying assumption that adjacent 'normal'
tissue of one cancer is like that of another.

The Cancer Genome Atlas (TCGA) work for the two cancers is quite
elaborate and vast. There are hundreds of samples for each cancer, and
TCGA seems to have very standardized protocols for handling,
processing and assaying the cancer specimens obtained from many
institutions. The mRNA expression (sequencing) data for both cancers
have been (and are being) obtained using the same platform.
Presumably, the sequencing data collection has been going on for
multiple years and has involved lots of different people. Knowing
this, one would think that persistent systemic variability between the
data from the two cancers will be minimal (especially, perhaps, for
sequencing data) and one could rationally combine data for the two
cancers for inter-cancer comparisons.

Santos

On Mon, Mar 18, 2013 at 1:07 PM, Ryan C. Thompson <rct at thompsonclan.org> wrote:
> It sounds like the two cancer projects are essentially independent. What do
> you hope to gain by combining them? As far as I can tell, the main advantage
> would be estimating dispersions from a larger pool of samples. Obviously,
> this only makes sense if the dispersions are similar in the two centers'
> datasets. You could try estimating dispersions separately for both centers
> and then plotting them against each other for each gene. If you get a
> reasonable clustering around the identity line, then you could probably
> justify combining the datasets for better dispersion estimation.
>
> In any case, if you decide to combine the datasets from multiple centers,
> you would probably want to use edgeR's GLM methods, not exactTest, since you
> would want to use a design matrix that incorporates effects for differences
> between centers (and, if more than one center works on the same cancer,
> center-cancer interaction effects).
>
> So, assuming that you believe that combining the datasets will improve your
> dispersion estimation, my answers to your questions would be:
>
> (1) If the two studies have no groups in common, then comparisons between a
> group from one study and a group from the other study will probably not be
> meaningful, since the effects would be confounded with the inter-study
> effects. However, comparing two groups from the same study is perfectly
> valid.
>
> (2) If different sequencing platforms were used, there are a few problems
> that could arise. One, they might produce very different library sizes, and
> while in theory edgeR should deal with this, in practice larger differences
> in library size will probably cause more problems, because there is more to
> correct for. Second, inter-platform differences will be confounded with
> inter-center differences. But that's ok since you don't necessarily need to
> know either of those directly. I'm sure there are other issues that I
> haven't thought of, so anyone else please feel free to chime in.
>
> (3) I see no reason that the above should not apply to microRNA libraries.
> The total library size shouldn't matter so much as the per-gene counts. If
> each miRNA has a maximum of 10 reads in every sample, then you're working
> from very little data and you should not expect very good results.
>
>
> On Sun 17 Mar 2013 07:48:50 PM PDT, Santos [guest] wrote:
>>
>>
>>
>> How suitable is edgeR for analyzing RNA sequencing data obtained from
>> multiple studies, possibly using multiple platforms?
>>
>> I am trying to compare mRNA sequencing data obtained for two different
>> cancers by the Cancer Genome Atlas (TCGA) project. Different research teams
>> are handling the work for the two different cancers, and TCGA regularly
>> releases updated, 'level 3,' (within-cancer) RSEM-processed data for
>> cancer-specific sub-projects (each with 200+ samples).
>>
>> I am trying to use edgeR for differential expression analyses with Exact
>> test, using 'raw count' values in the two cancer data-sets as the input for
>> edgeR. I plan to use edgeR with its default settings, except for prior.df in
>> estimateTagwiseDisp() -- intend to use 0.5 instead of 20 -- and,
>> rowsum.filter in estimateCommonDisp() -- intend to use perhaps 500 instead
>> of 5.
>>
>> (1) Is it OK to use edgeR for such cross-study comparison when the two
>> groups I want to compare have been exclusively examined by just one of the
>> two studies?
>>
>> (2) In my case, the sequencing platform is the same for the two studies.
>> Had it been different, could I still use edgeR?
>>
>> (3) Do answers to the above two questions also apply for microRNA
>> sequencing studies (where library [total count] sizes are typically 10-20x
>> smaller)?
>>
>> Thank you.
>>
>> Santos
>>
>>
>> -- output of sessionInfo():
>>
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] grid stats graphics grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] edgeR_3.0.8 limma_3.14.4 EBSeq_1.1.6
>> [4] gplots_2.11.0 MASS_7.3-23 KernSmooth_2.23-9
>> [7] caTools_1.14 gdata_2.12.0 gtools_2.7.0
>> [10] blockmodeling_0.1.8 reshape2_1.2.2 plyr_1.8
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-4.2 stringr_0.6.2 tools_2.15.1
>>
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor