[BioC] GAGE and PATHVIEW packages

Mon Oct 7 18:37:17 CEST 2013

Hi Christian,
mol.sum is written to combine or select multiple entries/probes of the same gene/molecule into one value. It should work on the differentially expressed data, i.e. fold changes or t-tests, rather than the original expression data. Because it select probes based on their variances.
For your original expression data, you may follow a similar approach as mol.sum. I would recommend to use "max.abs" to probe set with the max variance as the representative of the gene. 
In gage package, we have a vignette named “Gene set and data preparation” to address your issue in detail under the section of “Probe set ID conversion”. The vignette is available at: http://bioconductor.org/packages/2.13/bioc/vignettes/gage/inst/doc/dataPrep.pdf.
Weijun

--------------------------------------------
On Mon, 10/7/13, Christian De Santis <christian.desantis at stir.ac.uk> wrote:

 Subject: RE: GAGE and PATHVIEW packages

 Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
 Date: Monday, October 7, 2013, 7:49 AM

 Hi Weijun,

 Thanks for your prompt reply. It was very helpful to clarify
 my doubts, although it generated one more. 

 "mol.sum" it is an excellent function, thanks for pointing
 it out. The default sum.method for this function is "sum". I
 am not sure what "sum" is exactly computing (and being a
 novice I have difficulties to look at the code directly),
 but I assume that it will return the sum of  the
 intensities associated with replicates ID. The reason why I
 am asking is that I am using arrays with an unbalanced
 number of replicates probes (i.e. 3 for gene A, 6 for gene
 B, etc.). I have the feeling that the "sum" option would, in
 my case, put a greater weight on those pathways with core
 genes more present on the array (i.e. gene B). I tried two
 different methods to test my hypothesis, and by using "sum"
 I indeed got one of our target pathways called significant
 in the top 3, while it does not show up by using "mean" for
 example (most other pathways are consistent). I would
 appreciate if you could help me clarify this doubt and make
 a decision. Am I correct, based on the design of my arrays,
 to avoid choosing the method "sum"? 

 This should solve most of my doubts about your packages for
 now. Thanks again very much for your help. 

 Best regards,
 Christian

 -----Original Message-----

 Sent: 07 October 2013 01:11
 To: Christian De Santis
 Cc: bioconductor at r-project.org
 Subject: Re: GAGE and PATHVIEW packages

 Hi Christian,
 Please see my point-to-point answers below.
 HTHs,
 Weijun

 --------------------------------------------
 On Fri, 10/4/13, Christian De Santis <christian.desantis at stir.ac.uk>
 wrote:

  Subject: GAGE and PATHVIEW packages

 "bioconductor at r-project.org"
 <bioconductor at r-project.org>
  Date: Friday, October 4, 2013, 11:27 AM

  Dear Luo and list,

 > I am successfully using GAGE and pathview for my
  analyses and I like the package a lot. So, thanks for 
 developing it.  I have some points on which I would 
 appreciate some help and/or clarification. 

 Thanks for the comments.

 > AVERAGE VALUE - The first time I run the analysis with
  GAGE, I used an identical setup parameters as the
 example  prepared by you in the manual. I have 8
 replicates per  treatment and I initially used unique
 column names for each  sample (i.e. “DIET02_1, 
 DIET02_2, DIET02_3, etc.) as per your example with HN
 and  DCIS. However, I have discovered (following a
 casual
  mistake) that if instead of having a unique name samples
 are  named with the treatments they belong (i.e.
  “DIET02” for all 8 replicates), the subsequent 
 gage analysis it generates one single value for that 
 treatment. By comparing the p values of both the above
 cases  I have found that they are identical. Am I
 correct to assume  that in the latter case every value
 assigned to the  treatment are an average of the 
 replicates?

 It is the average, i.e. p-value is the genometric mean,
 while statistics is the mean of the columns with the same
 name. The average mechanism is there to accomdate special
 needs or mistakes, but it is not recommended to use the same
 name for replicate samples.

 > DUPLICATE PROBES – My array has got several
  duplicate or triplicate probes which are correctly
 annotated  with the same KO number. How are these
 probes handled by the  gage analysis? For example, if I
 have three probes for my  gene X which are annotated
 with  the same KO number, are these going to be counted
 3 times  into the “set size”? Or are the values for
 that  KO number going to be merged into one?

 Duplicate probes will be count for multiple times, which is
 not good. Because gene set analysis like GAGE really assume
 one independent variable per gene. You may summarize over
 duplicate probes before feed into GAGE. You can check
 ?mol.sum in pathview package for that.

 > “COMPARE” argument of “gage”
  function – My experiment consists of 5 treatments (x
 8  replicates). None of the treatments is a
 proper  “control”. Is it correct if I use as an
 argument  “1ongroup” choosing one of the treatment
 as a  ref? I have also tried the  “as.group”
 option but when I look at the results  I do not get a
 comparison of the chosen reference with the  remaining
 groups, but instead one single value named  “exp1”.
 I have also tried “paired”
  which gives completely different results. 

 If you set ref or samp other than NULL, GAGE assume it is a
 two state comparison. Compare argument may assume one value
 of 1ongrp, paired, unpaired, as.group based on needs. They
 are all for two state comparison, but to do it based on
 whether you samples are paired or not etc. If you want to do
 multiple state comparison/test, you should do before GAGE on
 each gene, then feed the single-column results into gage
 with “ref = NULL, samp = NULL”. If you want to do a
 two-state comparison, you should specify a control state,
 either all 4 groups other than your inntersting group, or
 the median of all groups for each gene. 

 > HEATMAP OUTPUT of “esset.grp” function
  – Is there any quick way to generate an output
 heatmap  (as for sigGeneSet) removing the redundant
 pathways  identified with function “esset.grp”? At
 the  moment I am doing this manually and plotting the
 results  into
  heatmap.2 from gplot. Is this the only way?

 You can do this quickly using esset.grp+ sigGeneSet,
 assuming you follow the examples till you get
 gse16873.kegg.esg.up and gse16873.kegg.esg.dn:
 ess.sets=c(gse16873.kegg.esg.up$essentialSets,
 gse16873.kegg.esg.dn$essentialSets)
 gse16873.kegg.p.ess=lapply(gse16873.kegg.p, function(x)
 x[ess.sets,])
 gse16873.kegg.sig.ess=sigGeneSet(gse16873.kegg.p.ess,
 outname="gse16873.kegg.ess")

   Any help on the above would be greatly 
 appreciated.

  Regards.
  Christian De Santis

  The University
  of Stirling has been ranked in the top 12 of UK
 universities  for graduate employment*.
  94% of
  our 2012 graduates were in work and/or further study
 within  six months of graduation.
  *The
  Telegraph
  The University of
  Stirling is a charity registered in Scotland, number
 SC  011159.

 -- 
 The University of Stirling has been ranked in the top 12 of
 UK universities for graduate employment*.
 94% of our 2012 graduates were in work and/or further study
 within six months of graduation.
 *The Telegraph
 The University of Stirling is a charity registered in
 Scotland, number SC 011159.