[BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays

Thu Mar 7 17:03:38 CET 2013

Wow. This is really an education on the vast unwashed underbelly of 
BioC, no?

There is a file called MoGene-2_0-st.mps that came in the zip file you 
downloaded. Add

mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE)

and then

coreMps = mps

when you create your AffyGenePDInfoPkgSeed. This file is used to 
distinguish between the probeset and transcript probe mappings.

Best,

Jim

On 3/7/2013 10:36 AM, Naxerova, Kamila wrote:
> Thanks Jim. Of course the annotation package does not contain probe -->  probe set information. What was I thinking?!??
>
> What I had not realized was that I needed to build the pd.mogene.2.0.st package myself first, because it also does not exist on Bioconductor. So I just downloaded all the required files from Affy, but again I am stuck with an error message I don't understand... what is the coreMPS file that gives me the error?
>
>> library(pdInfoBuilder)
>> baseDir<- "/Users/naxerova/Documents/xxx"
>> (pgf<- list.files(baseDir, pattern = ".pgf",
> + full.names = TRUE))
> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.pgf"
>> (clf<- list.files(baseDir, pattern = ".clf",
> + full.names = TRUE))
> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.clf"
>> (prob<- list.files(baseDir, pattern = ".probeset.csv",
> + full.names = TRUE))
> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st-v1.na33.mm10.probeset.csv"
>> seed<- new("AffyGenePDInfoPkgSeed",
> + pgfFile = pgf, clfFile = clf,
> + probeFile = prob, author = "Kamila Naxerova",
> + email = "naxerova at fas.harvard.edu",
> + biocViews = "AnnotationData",
> + organism = "Mouse", species = "Mus Musculus")
>> makePdInfoPackage(seed, destDir = ".")
> ===============================================================================================================================================
> Building annotation package for Affymetrix Gene ST Array
> PGF.........: MoGene-2_0-st.pgf
> CLF.........: MoGene-2_0-st.clf
> Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv
> Transcript..: TheTranscriptFile
> Core MPS....: coreMps
> ===============================================================================================================================================
> Parsing file: MoGene-2_0-st.pgf... OK
> Parsing file: MoGene-2_0-st.clf... OK
> Creating initial table for probes... OK
> Creating dictionaries... OK
> Parsing file: MoGene-2_0-st-v1.na33.mm10.probeset.csv... OK
> Parsing file: coreMps... Error in file(file, "rt") : cannot open the connection
> In addition: Warning message:
> In file(file, "rt") : cannot open file 'coreMps': No such file or directory
>
>
>
>
>
> On Mar 7, 2013, at 10:06 AM, "James W. MacDonald"<jmacdon at uw.edu>  wrote:
>
>> Hi Kamila,
>>
>> On 3/7/2013 9:54 AM, Naxerova, Kamila wrote:
>>> Dear all,
>>>
>>> I am afraid I have to ask for help with the Mouse Gene 2.0 ST annotation package one more time. It looked like I created it successfully, but when I try to use it to read in cel files with the oligo package, I get a cryptic error message. Any suggestions would be much appreciated!
>> You don't use the annotation package at this step. There are two
>> packages that are used for the analysis of this chip type. The first is
>> the pd.mogene.2.0.st.v1 package, which is used by oligo to map probes to
>> probesets when doing the normalization/summarization step. This package
>> will be automagically installed if you don't have it, so there is
>> nothing to be done at the first step but
>>
>> abatch<- read.celfiles(list.celfiles())
>> eset<- rma(abatch)
>>
>> This will give you the summarized and normalized data at the transcript
>> level. You then will normally fit some model(s) using the modeling
>> package of your choice, and then might want to output a set of
>> significant genes, at which time you will use the
>> mogene20sttranscriptcluster.db package to map probeset IDs to gene
>> information.
>>
>> Best,
>>
>> Jim
>>
>>
>>>> abatch<- read.celfiles(list.celfiles(),pkgname="mogene20sttranscriptcluster.db")
>>> Platform design info loaded.
>>> Reading in : xxx.CEL
>>> Reading in : xxx.CEL
>>> Reading in : xxx.CEL
>>> [... more cel files listed]
>>>
>>> Error in function (classes, fdef, mtable)  :
>>>    unable to find an inherited method for function ‘kind’ for signature ‘"ChipDb"’
>>>
>>> Thanks
>>> Kamila
>>>
>>> On Mar 6, 2013, at 6:16 PM, "Naxerova, Kamila"<naxerova at fas.harvard.edu>   wrote:
>>>
>>>> Dear Christian and Jim,
>>>>
>>>> many thanks to both of you for your explanations.
>>>>
>>>> Your hard work paid off, and I have finally understood everything and managed to build my annotation package!!!! I wrote a little script similar to what Jim was suggesting, namely picking the first RefSeq-like thing I came across. Jim called it "naive" -- but I think there is no downside to this approach, right? I have looked at various examples in the Affy file for a long time, and simply picking the first Refseq ID seems to be kosher.
>>>>
>>>> data<-read.csv("MoGene-transcript-noheader.csv",header=T,stringsAsFactors=F,sep=",")
>>>> sdata<- data[,c(1,9)]
>>>>
>>>> returnRef=function(x){
>>>> 	refst<- strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x,split="///")[[1]])[1]]
>>>> 	refid<- gsub(" ","",strsplit(refst,split="//")[[1]][1])
>>>> 	return(refid)
>>>> }
>>>>
>>>> sdata$refseqids<- sapply(sdata[,2],returnRef)
>>>> fdata<- sdata[,-2]
>>>> write.table(fdata,"AnnotBuild.txt", sep="\t",quote=F,row.names=F,col.names=F)
>>>>
>>>> library(AnnotationForge)
>>>> library(mouse.db0)
>>>> library(org.Mm.eg.db)
>>>> makeDBPackage("MOUSECHIP_DB",
>>>> affy=F,
>>>> prefix="mogene20sttranscriptcluster",
>>>> fileName="AnnotBuild.txt",
>>>> outputDir = ".",
>>>> version="2.11.1",
>>>> baseMapType="refseq",
>>>> manufacturer = "Affymetrix",
>>>> chipName = "Mouse Gene 2.0 ST Array",
>>>> manufacturerUrl = "http://www.affymetrix.com",
>>>> author = "Kamila Naxerova",
>>>> maintainer = "Kamila Naxerova<naxerova at fas.harvard.edu>")
>>>>
>>>>> install.packages("mogene20sttranscriptcluster.db",repos=NULL, type="source")
>>>> * installing *source* package ‘mogene20sttranscriptcluster.db’ ...
>>>> ** R
>>>> ** inst
>>>> ** preparing package for lazy loading
>>>> ** help
>>>> *** installing help indices
>>>> ** building package indices
>>>> ** testing if installed package can be loaded
>>>> *** arch - i386
>>>> *** arch - x86_64
>>>>
>>>> * DONE (mogene20sttranscriptcluster.db)
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> -- 
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099