[BioC] edgeR for data combined from different studies and/or platforms
Ryan C. Thompson
rct at thompsonclan.org
Mon Mar 18 18:07:29 CET 2013
It sounds like the two cancer projects are essentially independent. What
do you hope to gain by combining them? As far as I can tell, the main
advantage would be estimating dispersions from a larger pool of samples.
Obviously, this only makes sense if the dispersions are similar in the
two centers' datasets. You could try estimating dispersions separately
for both centers and then plotting them against each other for each
gene. If you get a reasonable clustering around the identity line, then
you could probably justify combining the datasets for better dispersion
estimation.
In any case, if you decide to combine the datasets from multiple
centers, you would probably want to use edgeR's GLM methods, not
exactTest, since you would want to use a design matrix that incorporates
effects for differences between centers (and, if more than one center
works on the same cancer, center-cancer interaction effects).
So, assuming that you believe that combining the datasets will improve
your dispersion estimation, my answers to your questions would be:
(1) If the two studies have no groups in common, then comparisons
between a group from one study and a group from the other study will
probably not be meaningful, since the effects would be confounded with
the inter-study effects. However, comparing two groups from the same
study is perfectly valid.
(2) If different sequencing platforms were used, there are a few
problems that could arise. One, they might produce very different
library sizes, and while in theory edgeR should deal with this, in
practice larger differences in library size will probably cause more
problems, because there is more to correct for. Second, inter-platform
differences will be confounded with inter-center differences. But that's
ok since you don't necessarily need to know either of those directly.
I'm sure there are other issues that I haven't thought of, so anyone
else please feel free to chime in.
(3) I see no reason that the above should not apply to microRNA
libraries. The total library size shouldn't matter so much as the
per-gene counts. If each miRNA has a maximum of 10 reads in every
sample, then you're working from very little data and you should not
expect very good results.
On Sun 17 Mar 2013 07:48:50 PM PDT, Santos [guest] wrote:
>
>
> How suitable is edgeR for analyzing RNA sequencing data obtained from
> multiple studies, possibly using multiple platforms?
>
> I am trying to compare mRNA sequencing data obtained for two different
> cancers by the Cancer Genome Atlas (TCGA) project. Different research
> teams are handling the work for the two different cancers, and TCGA
> regularly releases updated, 'level 3,' (within-cancer) RSEM-processed
> data for cancer-specific sub-projects (each with 200+ samples).
>
> I am trying to use edgeR for differential expression analyses with
> Exact test, using 'raw count' values in the two cancer data-sets as
> the input for edgeR. I plan to use edgeR with its default settings,
> except for prior.df in estimateTagwiseDisp() -- intend to use 0.5
> instead of 20 -- and, rowsum.filter in estimateCommonDisp() -- intend
> to use perhaps 500 instead of 5.
>
> (1) Is it OK to use edgeR for such cross-study comparison when the two
> groups I want to compare have been exclusively examined by just one of
> the two studies?
>
> (2) In my case, the sequencing platform is the same for the two
> studies. Had it been different, could I still use edgeR?
>
> (3) Do answers to the above two questions also apply for microRNA
> sequencing studies (where library [total count] sizes are typically
> 10-20x smaller)?
>
> Thank you.
>
> Santos
>
>
> -- output of sessionInfo():
>
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] grid stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] edgeR_3.0.8 limma_3.14.4 EBSeq_1.1.6
> [4] gplots_2.11.0 MASS_7.3-23 KernSmooth_2.23-9
> [7] caTools_1.14 gdata_2.12.0 gtools_2.7.0
> [10] blockmodeling_0.1.8 reshape2_1.2.2 plyr_1.8
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-4.2 stringr_0.6.2 tools_2.15.1
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list