[BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays

Wed Jun 27 17:07:09 CEST 2012

Thanks Andreas! That's really useful information, I will have a look.

Out of interest, did you look at the distribution of expression levels
for the different prob-sets? If you are including all probe-sets, I
would guess that if there were a lot of predicted/intronic probe sets
that aren't expressed that could bias your gene-level estimation, i.e.
if it the proportion is above the break-down point of the
summarisation/aggregation method.

Although perhaps the CDF from annmap takes care of that?

Cheers!

Jim

On 27 June 2012 16:45, Andreas Heider <aheider at trm.uni-leipzig.de> wrote:
> Ok, sorry, that was the "short answer". Here comes the longer one:
> 1. get a CDF for the chip, get it at http://annmap.picr.man.ac.uk/download/
> 2. load CEL files using standard affy package
> 3. asign the downloaded CDF to your AffyBatch object
> 4. calculate RMA or whatever you want (NOTE: this will get you all
> probesets, no restrictions as in eg "core")
> 5. pull the whole set of identifiers from biomaRt and annotate your
> expression matrix with this information
> 6. "collapse" probesets targetting the same identifier to its mean, median
> or medpolish, whatever suits your needs best via functions as "recast" or
> "aggregate"
> 7. have fun with your new expression matrix!
>
> Hope that helps, I needed also some time to figure out the individual steps.
>
>
> 2012/6/27 James Perkins <jperkins at biochem.ucl.ac.uk>
>>
>> Thanks for the pointer Andreas,
>>
>> How did you go from probe sets for a given gene to the transcript
>> level? And how did you know if it was "core", "extended", "full"
>> confidence?
>>
>> Also, how did you summarise the probeset expression levels to make a
>> transcript? Using biomart I get ~25k unique ensembl genes mapping to
>> probe set ids, which is much higher than when I follow the oligo
>> pipeline and perform RMA at core/extended/full level, and use getAffx
>> for annotation.
>>
>> Thanks,
>>
>> Jim
>>
>> On 27 June 2012 16:03, Andreas Heider <aheider at trm.uni-leipzig.de> wrote:
>> > Dear Jim,
>> > I pulled all relevant annotation via biomaRt, as biomart was all
>> > mappings of
>> > exon array probeset IDs to eg ENTREZID or GENESYMBOL. Than you can go on
>> > from that.
>> >
>> > Cheers,
>> > Andreas
>> >
>> >
>> > 2012/6/27 James Perkins <jperkins at biochem.ucl.ac.uk>
>> >>
>> >> Hi,
>> >>
>> >> I wasn't sure if this was worth starting a new thread for this, since
>> >> my question is very much related to this thread...
>> >>
>> >> Is there any plan to include the "comprehensive" exon array mappings?
>> >>
>> >> E.g. for rat:
>> >>
>> >> If one goes here
>> >>
>> >>
>> >>
>> >> http://www.affymetrix.com/estore/browse/products.jsp?productId=131489&categoryId=35748&productName=GeneChip-Rat-Exon-1.0-ST-Array#1_1
>> >>
>> >> Then to Technical Documentation tab
>> >>
>> >> And downloads the
>> >>
>> >> "Rat Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, full,
>> >> extended and comprehensive rn4" data
>> >>
>> >>
>> >>
>> >> http://www.affymetrix.com/Auth/support/downloads/library_files/RaEx-1_0-st-v1.r2.dt1.rn4.ps.zip
>> >>
>> >> There are the core/extended/full ps and mps files here.
>> >>
>> >> However there is also a comprehensive mps file.
>> >>
>> >> Full, core and extended are from 2006.
>> >>
>> >> The comprehensive is from 2010 (and gets updated more regularly), so
>> >> perhaps would be a better file to use for getNetAffx ?
>> >>
>> >> Apologies if this has been covered before. I am never sure of what is
>> >> the best way to analyse exon array data at the gene level.
>> >>
>> >> Thanks,
>> >>
>> >> Jim
>> >>
>> >>
>> >>
>> >>
>> >> On 13 June 2012 21:37, Benilton Carvalho <beniltoncarvalho at gmail.com>
>> >> wrote:
>> >> >
>> >> > please correct the code below to:
>> >> >
>> >> > eset = rma(raw, target='full') ## or 'core', 'extended' (whatever is
>> >> > available)
>> >> >
>> >> > and if you want results at the exon level
>> >> >
>> >> > eset = rma(raw, target='probeset')
>> >> > featureData(eset) = getNetAffx(raw, 'probeset')
>> >> >
>> >> > apologies for the mistake below.
>> >> >
>> >> > b
>> >> >
>> >> > On 13 June 2012 20:11, Benilton Carvalho <beniltoncarvalho at gmail.com>
>> >> > wrote:
>> >> > > FWIW, remember that you can obtain the contents of the annotation
>> >> > > files (the NA32 Affymetrix files) with:
>> >> > >
>> >> > > library(Biobase)
>> >> > > library(oligo)
>> >> > > raw = read.celfiles(list.celfiles())
>> >> > > eset = rma(raw, target='transcript')
>> >> > > featureData(eset) = getNetAffx(eset, 'transcript')
>> >> > > head(fData(eset))
>> >> > >
>> >> > > b
>> >> > >
>> >> > > On 13 June 2012 15:47, James W. MacDonald <jmacdon at uw.edu> wrote:
>> >> > >> Hi Andreas,
>> >> > >>
>> >> > >>
>> >> > >> On 6/13/2012 3:14 AM, Andreas Heider wrote:
>> >> > >>>
>> >> > >>> Dear mailing list,
>> >> > >>> I know this was on the list couple of times, and I think I read
>> >> > >>> it
>> >> > >>> all,
>> >> > >>> but
>> >> > >>> actually I still don't get it right. So here is my problem:
>> >> > >>>
>> >> > >>> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT
>> >> > >>> Mouse
>> >> > >>> Gene
>> >> > >>> 1.0
>> >> > >>> ST) in a similar fashion to eg. HG-U133 arrays.
>> >> > >>> That means, I want to finally have it accessible as an
>> >> > >>> ExpressionSet
>> >> > >>> object
>> >> > >>> with a right Bioconductor annotation assigned. This should
>> >> > >>> include
>> >> > >>> GENE
>> >> > >>> SYMBOLS, RefSeq IDs and ENTREZ IDs.
>> >> > >>
>> >> > >>
>> >> > >> The problem here is that you want to do something that AFAIK isn't
>> >> > >> easy to
>> >> > >> do. The Gene ST arrays allow you to summarize all the probes that
>> >> > >> interrogate a particular transcript (e.g., all the exon-level
>> >> > >> probesets are
>> >> > >> collapsed to transcript level, and then you summarize). However,
>> >> > >> for
>> >> > >> the
>> >> > >> Exon ST arrays that isn't the case, unless there is something in
>> >> > >> xps
>> >> > >> to
>> >> > >> allow for that - I know next to nothing about that package, so
>> >> > >> Cristian
>> >> > >> Stratowa will have to chime in if I am missing something.
>> >> > >>
>> >> > >> For the Exon chips, you are always summarizing at the same
>> >> > >> probeset
>> >> > >> level,
>> >> > >> where there are <= 4 probes per probeset, and there can be any
>> >> > >> number
>> >> > >> of
>> >> > >> probesets that interrogate a given exon. Lots of these probesets
>> >> > >> interrogate
>> >> > >> regions that aren't even transcribed, according to current
>> >> > >> knowledge
>> >> > >> of the
>> >> > >> genome. When you choose core, extended or full probesets, you are
>> >> > >> just
>> >> > >> changing the number of probesets being used, not summarizing at a
>> >> > >> different
>> >> > >> level as with the Gene ST chip.
>> >> > >>
>> >> > >> So when you say you want gene symbols, refseq ids and gene ids,
>> >> > >> what
>> >> > >> exactly
>> >> > >> are you after? If a given probeset is in the intron of a gene do
>> >> > >> you
>> >> > >> want to
>> >> > >> annotate it as being part of that gene? How about if it is in the
>> >> > >> UTR
>> >> > >> (or
>> >> > >> really close to the UTR)? What do you want to do with the
>> >> > >> probesets
>> >> > >> where
>> >> > >> one or more of the probes binds in multiple positions in the
>> >> > >> genome?
>> >> > >> These
>> >> > >> are all questions that the exonmap package tries to consider, and
>> >> > >> it
>> >> > >> gets
>> >> > >> really complicated. That's why Affy went with the Gene ST chips -
>> >> > >> they
>> >> > >> unleashed the Exon chips on us and couldn't sell them because
>> >> > >> people
>> >> > >> were
>> >> > >> saying WTF do I do with this thing?
>> >> > >>
>> >> > >> I don't think there is an easy or obvious answer to your question.
>> >> > >> If
>> >> > >> you
>> >> > >> were to come up with what you think are reasonable answers to my
>> >> > >> questions,
>> >> > >> then it wouldn't be much work to extract the chr, start, end from
>> >> > >> the
>> >> > >> pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g.,
>> >> > >>  findOverlaps()) to decide what regions are being interrogated,
>> >> > >> and
>> >> > >> annotate
>> >> > >> from there.
>> >> > >>
>> >> > >> Best,
>> >> > >>
>> >> > >> Jim
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>>
>> >> > >>> I can import it as a AffyBatch and generate an ExpressionSet with
>> >> > >>> the help
>> >> > >>> of the Xmap/exonmap supplied CDF, but there is no annotation
>> >> > >>> attached to
>> >> > >>> it.
>> >> > >>>
>> >> > >>> OR
>> >> > >>>
>> >> > >>> I can import the CEL files with the "oligo" package as a Exon
>> >> > >>> Array
>> >> > >>> object
>> >> > >>> and generate an ExpressionSet from it.
>> >> > >>> However in that case it still have no annotation.
>> >> > >>>
>> >> > >>> Surprisingly on the Bioconductor website there are all packages
>> >> > >>> needed to
>> >> > >>> deal with Mouse Gene 1.0 ST arrays but the informtion to work
>> >> > >>> with
>> >> > >>> Mouse
>> >> > >>> Exon 1.0 ST arrays seems missing!
>> >> > >>>
>> >> > >>> What am I doing wrong here? Has someone else had such problems?
>> >> > >>>
>> >> > >>> Thanks in advance for your effort,
>> >> > >>> Andreas
>> >> > >>>
>> >> > >>>        [[alternative HTML version deleted]]
>> >> > >>>
>> >> > >>> _______________________________________________
>> >> > >>> Bioconductor mailing list
>> >> > >>> Bioconductor at r-project.org
>> >> > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> > >>> Search the archives:
>> >> > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >> > >>
>> >> > >>
>> >> > >> --
>> >> > >> James W. MacDonald, M.S.
>> >> > >> Biostatistician
>> >> > >> University of Washington
>> >> > >> Environmental and Occupational Health Sciences
>> >> > >> 4225 Roosevelt Way NE, # 100
>> >> > >> Seattle WA 98105-6099
>> >> > >>
>> >> > >>
>> >> > >> _______________________________________________
>> >> > >> Bioconductor mailing list
>> >> > >> Bioconductor at r-project.org
>> >> > >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> > >> Search the archives:
>> >> > >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >> >
>> >> > _______________________________________________
>> >> > Bioconductor mailing list
>> >> > Bioconductor at r-project.org
>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> > Search the archives:
>> >> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>> >
>
>