[BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays

Fri Mar 8 04:04:23 CET 2013

Hi Kamila et. al.,

FYI: I'll ensure that the pd packages for the 2.0 versions of the chip
are available for the next BioC release.

benilton

2013/3/7 Naxerova, Kamila <naxerova at fas.harvard.edu>:
> Haha, I don't mind rubbing against that belly. It's like playing a game... tyring to figure out how to get to the 40th level to face the "final enemy".
>
> Your strategy worked, thank you! I am including all code up to RMA normalization (yay! I am there!) below, perhaps it will save somebody a few hours of work.
>
> library(pdInfoBuilder)
>
> baseDir <- "/Users/naxerova/Documents/xxx"
> (pgf <- list.files(baseDir, pattern = ".pgf",
> full.names = TRUE))
> (clf <- list.files(baseDir, pattern = ".clf",
> full.names = TRUE))
> (prob <- list.files(baseDir, pattern = ".probeset.csv",
> full.names = TRUE))
> mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE)
> trans <- list.files(baseDir, pattern="transcript",full.names=TRUE)
>
> seed <- new("AffyGenePDInfoPkgSeed",
> pgfFile = pgf, clfFile = clf, coreMps=mps, transFile=trans,
> probeFile = prob, author = "Kamila Naxerova",
> email = "naxerova at fas.harvard.edu",
> biocViews = "AnnotationData",
> organism = "Mouse", species = "Mus Musculus")
> makePdInfoPackage(seed, destDir = ".")
>
> ## This is what the beginning of your output should look like
> Building annotation package for Affymetrix Gene ST Array
> PGF.........: MoGene-2_0-st.pgf
> CLF.........: MoGene-2_0-st.clf
> Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv
> Transcript..: MoGene-2_0-st-v1.na33.mm10.transcript.csv
> Core MPS....: MoGene-2_0-st.mps
>
> install.packages("/Users/naxerova/pd.mogene.2.0.st/", repos=NULL, type="source")
>
>> abatch <- read.celfiles(list.celfiles())
> Loading required package: pd.mogene.2.0.st
> Platform design info loaded.
> Reading in : xxx.CEL
> [etc.]
>> eset <- rma(abatch)
> Background correcting
> Normalizing
> Calculating Expression
>
> On Mar 7, 2013, at 11:03 AM, James W. MacDonald <jmacdon at uw.edu> wrote:
>
>> Wow. This is really an education on the vast unwashed underbelly of
>> BioC, no?
>>
>> There is a file called MoGene-2_0-st.mps that came in the zip file you
>> downloaded. Add
>>
>> mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE)
>>
>> and then
>>
>> coreMps = mps
>>
>> when you create your AffyGenePDInfoPkgSeed. This file is used to
>> distinguish between the probeset and transcript probe mappings.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> On 3/7/2013 10:36 AM, Naxerova, Kamila wrote:
>>> Thanks Jim. Of course the annotation package does not contain probe -->  probe set information. What was I thinking?!??
>>>
>>> What I had not realized was that I needed to build the pd.mogene.2.0.st package myself first, because it also does not exist on Bioconductor. So I just downloaded all the required files from Affy, but again I am stuck with an error message I don't understand... what is the coreMPS file that gives me the error?
>>>
>>>> library(pdInfoBuilder)
>>>> baseDir<- "/Users/naxerova/Documents/xxx"
>>>> (pgf<- list.files(baseDir, pattern = ".pgf",
>>> + full.names = TRUE))
>>> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.pgf"
>>>> (clf<- list.files(baseDir, pattern = ".clf",
>>> + full.names = TRUE))
>>> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.clf"
>>>> (prob<- list.files(baseDir, pattern = ".probeset.csv",
>>> + full.names = TRUE))
>>> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st-v1.na33.mm10.probeset.csv"
>>>> seed<- new("AffyGenePDInfoPkgSeed",
>>> + pgfFile = pgf, clfFile = clf,
>>> + probeFile = prob, author = "Kamila Naxerova",
>>> + email = "naxerova at fas.harvard.edu",
>>> + biocViews = "AnnotationData",
>>> + organism = "Mouse", species = "Mus Musculus")
>>>> makePdInfoPackage(seed, destDir = ".")
>>> ===============================================================================================================================================
>>> Building annotation package for Affymetrix Gene ST Array
>>> PGF.........: MoGene-2_0-st.pgf
>>> CLF.........: MoGene-2_0-st.clf
>>> Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv
>>> Transcript..: TheTranscriptFile
>>> Core MPS....: coreMps
>>> ===============================================================================================================================================
>>> Parsing file: MoGene-2_0-st.pgf... OK
>>> Parsing file: MoGene-2_0-st.clf... OK
>>> Creating initial table for probes... OK
>>> Creating dictionaries... OK
>>> Parsing file: MoGene-2_0-st-v1.na33.mm10.probeset.csv... OK
>>> Parsing file: coreMps... Error in file(file, "rt") : cannot open the connection
>>> In addition: Warning message:
>>> In file(file, "rt") : cannot open file 'coreMps': No such file or directory
>>>
>>>
>>>
>>>
>>>
>>> On Mar 7, 2013, at 10:06 AM, "James W. MacDonald"<jmacdon at uw.edu>  wrote:
>>>
>>>> Hi Kamila,
>>>>
>>>> On 3/7/2013 9:54 AM, Naxerova, Kamila wrote:
>>>>> Dear all,
>>>>>
>>>>> I am afraid I have to ask for help with the Mouse Gene 2.0 ST annotation package one more time. It looked like I created it successfully, but when I try to use it to read in cel files with the oligo package, I get a cryptic error message. Any suggestions would be much appreciated!
>>>> You don't use the annotation package at this step. There are two
>>>> packages that are used for the analysis of this chip type. The first is
>>>> the pd.mogene.2.0.st.v1 package, which is used by oligo to map probes to
>>>> probesets when doing the normalization/summarization step. This package
>>>> will be automagically installed if you don't have it, so there is
>>>> nothing to be done at the first step but
>>>>
>>>> abatch<- read.celfiles(list.celfiles())
>>>> eset<- rma(abatch)
>>>>
>>>> This will give you the summarized and normalized data at the transcript
>>>> level. You then will normally fit some model(s) using the modeling
>>>> package of your choice, and then might want to output a set of
>>>> significant genes, at which time you will use the
>>>> mogene20sttranscriptcluster.db package to map probeset IDs to gene
>>>> information.
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>>> abatch<- read.celfiles(list.celfiles(),pkgname="mogene20sttranscriptcluster.db")
>>>>> Platform design info loaded.
>>>>> Reading in : xxx.CEL
>>>>> Reading in : xxx.CEL
>>>>> Reading in : xxx.CEL
>>>>> [... more cel files listed]
>>>>>
>>>>> Error in function (classes, fdef, mtable)  :
>>>>>   unable to find an inherited method for function ‘kind’ for signature ‘"ChipDb"’
>>>>>
>>>>> Thanks
>>>>> Kamila
>>>>>
>>>>> On Mar 6, 2013, at 6:16 PM, "Naxerova, Kamila"<naxerova at fas.harvard.edu>   wrote:
>>>>>
>>>>>> Dear Christian and Jim,
>>>>>>
>>>>>> many thanks to both of you for your explanations.
>>>>>>
>>>>>> Your hard work paid off, and I have finally understood everything and managed to build my annotation package!!!! I wrote a little script similar to what Jim was suggesting, namely picking the first RefSeq-like thing I came across. Jim called it "naive" -- but I think there is no downside to this approach, right? I have looked at various examples in the Affy file for a long time, and simply picking the first Refseq ID seems to be kosher.
>>>>>>
>>>>>> data<-read.csv("MoGene-transcript-noheader.csv",header=T,stringsAsFactors=F,sep=",")
>>>>>> sdata<- data[,c(1,9)]
>>>>>>
>>>>>> returnRef=function(x){
>>>>>>   refst<- strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x,split="///")[[1]])[1]]
>>>>>>   refid<- gsub(" ","",strsplit(refst,split="//")[[1]][1])
>>>>>>   return(refid)
>>>>>> }
>>>>>>
>>>>>> sdata$refseqids<- sapply(sdata[,2],returnRef)
>>>>>> fdata<- sdata[,-2]
>>>>>> write.table(fdata,"AnnotBuild.txt", sep="\t",quote=F,row.names=F,col.names=F)
>>>>>>
>>>>>> library(AnnotationForge)
>>>>>> library(mouse.db0)
>>>>>> library(org.Mm.eg.db)
>>>>>> makeDBPackage("MOUSECHIP_DB",
>>>>>> affy=F,
>>>>>> prefix="mogene20sttranscriptcluster",
>>>>>> fileName="AnnotBuild.txt",
>>>>>> outputDir = ".",
>>>>>> version="2.11.1",
>>>>>> baseMapType="refseq",
>>>>>> manufacturer = "Affymetrix",
>>>>>> chipName = "Mouse Gene 2.0 ST Array",
>>>>>> manufacturerUrl = "http://www.affymetrix.com",
>>>>>> author = "Kamila Naxerova",
>>>>>> maintainer = "Kamila Naxerova<naxerova at fas.harvard.edu>")
>>>>>>
>>>>>>> install.packages("mogene20sttranscriptcluster.db",repos=NULL, type="source")
>>>>>> * installing *source* package ‘mogene20sttranscriptcluster.db’ ...
>>>>>> ** R
>>>>>> ** inst
>>>>>> ** preparing package for lazy loading
>>>>>> ** help
>>>>>> *** installing help indices
>>>>>> ** building package indices
>>>>>> ** testing if installed package can be loaded
>>>>>> *** arch - i386
>>>>>> *** arch - x86_64
>>>>>>
>>>>>> * DONE (mogene20sttranscriptcluster.db)
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>>>
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor