[BioC] edgeR for data combined from different studies and/or platforms

Mon Mar 18 18:07:29 CET 2013

It sounds like the two cancer projects are essentially independent. What 
do you hope to gain by combining them? As far as I can tell, the main 
advantage would be estimating dispersions from a larger pool of samples. 
Obviously, this only makes sense if the dispersions are similar in the 
two centers' datasets. You could try estimating dispersions separately 
for both centers and then plotting them against each other for each 
gene. If you get a reasonable clustering around the identity line, then 
you could probably justify combining the datasets for better dispersion 
estimation.

In any case, if you decide to combine the datasets from multiple 
centers, you would probably want to use edgeR's GLM methods, not 
exactTest, since you would want to use a design matrix that incorporates 
effects for differences between centers (and, if more than one center 
works on the same cancer, center-cancer interaction effects).

So, assuming that you believe that combining the datasets will improve 
your dispersion estimation, my answers to your questions would be:

(1) If the two studies have no groups in common, then comparisons 
between a group from one study and a group from the other study will 
probably not be meaningful, since the effects would be confounded with 
the inter-study effects. However, comparing two groups from the same 
study is perfectly valid.

(2) If different sequencing platforms were used, there are a few 
problems that could arise. One, they might produce very different 
library sizes, and while in theory edgeR should deal with this, in 
practice larger differences in library size will probably cause more 
problems, because there is more to correct for. Second, inter-platform 
differences will be confounded with inter-center differences. But that's 
ok since you don't necessarily need to know either of those directly. 
I'm sure there are other issues that I haven't thought of, so anyone 
else please feel free to chime in.

(3) I see no reason that the above should not apply to microRNA 
libraries. The total library size shouldn't matter so much as the 
per-gene counts. If each miRNA has a maximum of 10 reads in every 
sample, then you're working from very little data and you should not 
expect very good results.

On Sun 17 Mar 2013 07:48:50 PM PDT, Santos [guest] wrote:
>
>
> How suitable is edgeR for analyzing RNA sequencing data obtained from 
> multiple studies, possibly using multiple platforms?
>
> I am trying to compare mRNA sequencing data obtained for two different 
> cancers by the Cancer Genome Atlas (TCGA) project. Different research 
> teams are handling the work for the two different cancers, and TCGA 
> regularly releases updated, 'level 3,' (within-cancer) RSEM-processed 
> data for cancer-specific sub-projects (each with 200+ samples).
>
> I am trying to use edgeR for differential expression analyses with 
> Exact test, using 'raw count' values in the two cancer data-sets as 
> the input for edgeR. I plan to use edgeR with its default settings, 
> except for prior.df in estimateTagwiseDisp() -- intend to use 0.5 
> instead of 20 -- and, rowsum.filter in estimateCommonDisp() -- intend 
> to use perhaps 500 instead of 5.
>
> (1) Is it OK to use edgeR for such cross-study comparison when the two 
> groups I want to compare have been exclusively examined by just one of 
> the two studies?
>
> (2) In my case, the sequencing platform is the same for the two 
> studies. Had it been different, could I still use edgeR?
>
> (3) Do answers to the above two questions also apply for microRNA 
> sequencing studies (where library [total count] sizes are typically 
> 10-20x smaller)?
>
> Thank you.
>
> Santos
>
>
> -- output of sessionInfo():
>
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] grid stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] edgeR_3.0.8 limma_3.14.4 EBSeq_1.1.6
> [4] gplots_2.11.0 MASS_7.3-23 KernSmooth_2.23-9
> [7] caTools_1.14 gdata_2.12.0 gtools_2.7.0
> [10] blockmodeling_0.1.8 reshape2_1.2.2 plyr_1.8
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-4.2 stringr_0.6.2 tools_2.15.1
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor