[BioC] Duplicate gene names after summarization with RMA (hugene.1.0.st.v1)

Mon Apr 7 16:43:36 CEST 2014

Hi Ed,

On 4/6/2014 4:24 PM, Ed O'Donnell [guest] wrote:
> Hello,
>
> I am new to analyzing array files. I am attempting to generate a CSV file that contains a gene symbol and RMA-processed expression data for a set of arrays for input into an online pathway ID tool (TNBCtype, http://cbc.mc.vanderbilt.edu/tnbc/).
>
> My problem/question (not sure if It is either, or I don't understand the process correctly):

It's the latter. You are not summarizing at the gene level, but at the 
transcript level (hence the hugene10sttranscriptcluster.db, not 
hugene10sttgene.db). In other words, there may be multiple probesets on 
the array that are intended to measure different transcript variants for 
the same gene. As an example, for ESR1, there are apparently two 
probesets, that interrogate two different transcripts (per the ENSEMBL 
transcript IDs):

8122840 ENST00000338799
8122843 ENST00000440973

And if you go to the respective websites for these two ENSEMBL IDs, you 
can see that these are very different transcripts.

As far as I can tell, the vast majority of people take transcript level 
data and flatten it to gene level (as you are doing), and then look for 
differences in quantity without regard to the form of the transcript. In 
which case you will have 'duplicate' genes.

If it is important to only have one gene, then you can use the 
findLargest function in genefilter, or you could use the MBNI re-mapped 
cdfs based on Entrez Gene, which map the probesets to the gene level. 
But note that you will need to use the affy package for analysis with 
the MBNI cdf packages.

Best,

Jim

>
> when I am exporting the csv file, there are duplicate entries for some gene names (i.e. ESR1). I am under the impression that RMA and the process I am using (target = 'core') summarizes at the gene level, so I am not sure why I am getting duplicate entries for certain (not all) genes after writing the expression file.  I have gone through this process with some mouse array data (mouse gene 10 st arrays) and have not run into this problem of duplicate gene names.
>
> Any insights on what I might be doing incorrectly, or in understanding the output I should expect, would be greatly appreciated.
>
> Is averaging the values of these instances of duplicate gene names a valid thing to do?
>
> Thank you!
>
> -Ed O'Donnell
> postdoctoral scholar
> Oregon state university
>
> My commands (Analysis.R), run as source("Analysis.R"):
> ---------------------
>
> #install packages for analysis of the mouse array
>
> source("http://bioconductor.org/biocLite.R")
> biocLite("hugene10sttranscriptcluster.db")
> biocLite("oligo")
> biocLite("annotate")
>
> #load required packages
>
> library(oligo)
> library(hugene10sttranscriptcluster.db)
> library(annotate)
>
> #set wd to myworkingdirectory
>
> setwd("myworkingdirectory")
>
> #read in the raw data from the files and the pDatat
>
> rawData <- read.celfiles(list.celfiles())
>
> #rma normalization
>
> rmaCore <- rma(rawData, target = 'core')
>
> #annotation
>
> ID <- featureNames(rmaCore)
> Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db")
> Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME"))
>
> #make a temporary data frame with all the identifiers...
>
> tmpframe <-data.frame(ID=ID, Symbol=Symbol, Name=Name,stringsAsFactors=F)
> tmpframe[tmpframe=="NA"] <- NA
>
> #assign data frame to rma-results
>
> fData(rmaCore) <- tmpframe
>
> #expression table with gene name and annotation info, processed with sed after export to get the quotations in the right spot and remove NA lines
>
> write.table(cbind(pData(featureData(rmaCore))[,"Symbol"],exprs(rmaCore)),file="better_annotation.csv", quote = FALSE, sep = ",")
>
> ----------
>
>
>
>   -- output of sessionInfo():
>
> R version 3.0.3 (2014-03-06)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>   [1] pd.hugene.1.0.st.v1_3.8.0            gplots_2.12.1
>   [3] annotate_1.40.1                      hugene10sttranscriptcluster.db_8.0.1
>   [5] org.Hs.eg.db_2.10.1                  RSQLite_0.11.4
>   [7] DBI_0.2-7                            AnnotationDbi_1.24.0
>   [9] limma_3.18.13                        oligo_1.26.6
> [11] Biostrings_2.30.1                    XVector_0.2.0
> [13] IRanges_1.20.7                       Biobase_2.22.0
> [15] oligoClasses_1.24.0                  BiocGenerics_0.8.0
> [17] BiocInstaller_1.12.0
>
> loaded via a namespace (and not attached):
>   [1] affxparser_1.34.2     affyio_1.30.0         bit_1.1-11
>   [4] bitops_1.0-6          caTools_1.16          codetools_0.2-8
>   [7] ff_2.2-12             foreach_1.4.1         gdata_2.13.2
> [10] GenomicRanges_1.14.4  gtools_3.3.1          iterators_1.0.6
> [13] KernSmooth_2.23-12    preprocessCore_1.24.0 splines_3.0.3
> [16] stats4_3.0.3          tcltk_3.0.3           tools_3.0.3
> [19] XML_3.95-0.2          xtable_1.7-3          zlibbioc_1.8.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099