[BioC] Reading Affy CEL files

Fri May 31 19:05:24 CEST 2013

Hi Ranjani,

On 5/31/2013 12:53 PM, Ranjani R [guest] wrote:
> I am a newbie to Affy. Thanks for your help.
>
> I am processing CEL files through R (Affy package) and am having some basic issues that I am not finding satisfactory answers to (have googled).
> The chip used is hugene11stv1. I also am using the hugene11stprobeset.db to try to do probeset â€“>  Symbol translation.
> Essentially, I want to create a file with gene expression data, with  genes * samples as my final matrix.
>
> Code:
> setwd(wDir);
> Data<- ReadAffy();
> eset<- rma(Data);
> write.exprs(eset,file="geneExpData.txt", sep="\t", quote = F);
>
> When I analyze the file written, I see that the number of columns is as I expect(number samples) but there are 33,297 genes.
> Please help me understand a few fundamental aspects here:
>
> 1. I tried translating these Affy IDs to gene symbols to see if that would make my analysis easier.
>      Here are some things I tried
>
>      Try 1:
>      symbols<- getSYMBOL(as.character(expr.matrix[,1]), "hugene11stprobeset"); â€“>   Not quite working. Only ~175 of the probeset IDs are getting translated.

There are two problems here. First, the affy package isn't designed for 
this array, and in fact won't let you proceed if you upgrade to the new 
version of Bioconductor. You should really be using either oligo or xps 
(both BioC packages) for the analysis of this array.

Second, the affy package is only able to summarize these arrays at the 
transcript level, and you are trying to annotate using a package that 
assumes you have summarized at the probeset level (where each probeset 
is only interrogating a smaller portion of the transcript, often just a 
single exon). If you want to annotate your transcript level data, you 
need the hugene11sttranscriptcluster.db package.

Best,

Jim

>      Try 2:
>      symbs<- mget(featureNames(eset), hugene11stprobesetSYMBOL, ifnotfound =NA);
>      symbs<- unlist(symbs)
>      mat<- eset; # make a copy
>      featureNames(mat)<- ifelse(!is.na(symbs), symbs, featureNames(mat))
>
>      Many NAs.
>
> Can you please help me understand what is happening here.
>
>
>   -- output of sessionInfo():
>
> R version 2.15.3 (2013-03-01)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] hugene11stv1cdf_2.3.0 affy_1.36.1           Biobase_2.18.0
> [4] BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.26.0         BiocInstaller_1.8.3   preprocessCore_1.20.0
> [4] tools_2.15.3          zlibbioc_1.4.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099