[BioC] observations on affyprobeminer

Tue May 6 05:55:07 CEST 2008

> I was not clear in my original post, I filtered out duplicate entrez
gene
> IDs taking only the one with the lowest p-value.

I see...
Then there is really no simple answer to what you are observing; some of
the probesets in the Affymetrix mapping are the result of combining
information from several sources, and are possibly measuring a genuine RNA
signal that is not (yet) included in RefSeq for example.

Hongfang has mentioned more in details what could be happening with splice
variants.

The MBNI is also providing alternative mappings coming from several
sources of target sequences (RefSeq, ensembl, ...); they are definitely
worth a look.

> I do appreciate your
> advice, we still don't have a perfect solution and it is good to use
multiple tools to look at each dataset, always taking into consideration a
> priori the questions that we really want answered, the cost/benefit
ratio
> of
> increased sensitivity/decreased specificity, etc. with various
techniques.

When it comes to building hypothesis around biology, the mappings help
telling whether a probeset is definitely mesuring the level of a
particular transcript or might be mesuring a less defined "something".

> I do thank you for your efforts at developing APM, it is a great tool
and
> one that I am sure to use, and publish with, in the future,

All credits for affyprobeminer should go to its authors.
(after looking it up, the list seem to be there:
http://gauss.dbb.georgetown.edu/liblab/affyprobeminer/credits.html )

My effort on the issue was made available in the package "altcdfenvs".

L.

> Mark
>
> On Sun, May 4, 2008 at 5:46 AM, <lgautier at altern.org> wrote:
>
>> > I have recently explored the use of alternative CDFs from
>> affyprobeminer
>> (APM) or a 36 array dataset derived using the Affy rat2302 chipset. I
used
>> > both the Affy cdf and the transcript-level affyprobeminer cdf. I
>> preprocessed using RMA, filtered using an A/P filter, and statistically
analyzed using an appropriate lme model followed by qvalue FDR
>> correction.
>> > I
>> > set my FDR threshold at 5%. I eliminated duplicate genes by picking
>> the
>> one
>> > with the lowest p-value.
>> >
>> > Using the Affy cdf, I got ~2000 sig. genes, which APM ~1000. If I
>> choose
>> only those EntrezGene identifiers present on both cdfs, my number sig.
with
>> > the APM cdf was ~1000 and there was a 90% overlap with the Affy sig.
>> list.
>> > My conclusion from the latter observation is that I am measuring
>> largely
>> the
>> > same transcripts/genes with both CDFs.
>> >
>> > I was interested in the ~1000 genes which are annotated with the Affy
>> CDF
>> > but not the APM cdf. Following the logic behind APM, I would assume
>> that
>> these would be largely incorrectly annotated probesets or probesets
that
>> are
>> > not really measuring any "real" transcript. This list should, then,
>> consist
>> > largely of random genes. To test this hypothesis, I used the Category
>> package to test for over-representation of GO and KEGG categories in my
various lists. What I found was a huge degree of overlap between: 1. the
>> affy genes also annotated with APM, 2. the affy genes not annotated
with
>> APM, 3. the genes derived solely from APM.
>> >
>> > My conclusion from this latest observation is that APM is not
>> annotating
>> a
>> > large number of genes/transcripts that are in fact real. Assuming
that
>> APM
>> > is correctly throwing out some "junk" probesets, is it throwing out
>> the
>> baby
>> > with the bathwater?
>> Not necessarily.
>> With Affymetrix mappings, there are a large number of cases from which
there are multiple probesets for a "gene" (in the example below with
hgu133a, that represents 20% of the probesets), and those
probesets
>> can be collapsed into one when remapping.
>> Here is an example with few probesets (the example is mostly a
>> copy-paste
>> from one of the examples in the vignette "altcdfenvs"):
>> geneSymbols <- c("IGKC", "IL8", "NENF", "TRIO")
>> # Count the probesets associated with our geneSymbols
>> library(hgu133a.db)
>> sapply(geneSymbols,
>>       function(x) length(mappedkeys(subset(hgu133aSYMBOL, Rkeys=x))))
>> # This returns:
>> #IGKC  IL8 NENF TRIO
>> #  15   12    6    9
>> # Which means that there are 9 probesets for TRIO, 6 for NENF, etc... #
Now lets check what comes out of remapping
>> library(altcdfenvs)
>> library(biomaRt)
>> mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
>> getSeq <- function(name) {
>>  seq <- getSequence(id=name, type="hgnc_symbol",
>>                     seqType="cdna", mart = mart)
>>  targets <- seq$cdna
>>  if (is.null(targets))
>>    return(character(0))
>>  names(targets) <- paste(seq$hgnc_symbol, 1:nrow(seq), sep="-")
return(targets)
>> }
>> targets <- unlist(lapply(geneSymbols,
>>                         getSeq))
>> m <- matchAffyProbes(hgu133aprobe, targets, "HG-U133A")
>> hg <- toHypergraph(m)
>> gn <- toGraphNEL(hg)
>> library(RColorBrewer)
>> col <- brewer.pal(length(geneSymbols)+1, "Set1")
>> tColors <- rep(col[length(col)], length=numNodes(gn))
>> names(tColors) <- nodes(gn)
>> for (col_i in 1:(length(col)-1)) {
>>  node_i <- grep(paste("^", geneSymbols[col_i],
>>                       "-", sep=""),
>>                       names(tColors))
>>  tColors[node_i] <- col[col_i]
>> }
>> nAttrs <- list(fillcolor = tColors)
>> library(Rgraphviz)
>> plot(gn, "twopi", nodeAttrs=nAttrs)
>> # the plot will show that the situation is not as simple for TRIO as #
as it is with the other gene symbols.
>> > I'd be interested to hear the thoughts and experiences of others.
I've
>> certainly run into occasions where Affy annotated probesets turn out to
represent introns or something other than they purport to be, and I was
hoping that APM would solve this problem, but I don't want to use it if it
>> > means a massive loss of truly significant data.
>> The situation is indeed not always clear... at the moment, I would not
advice you to follow blindly any particular mapping, yet have
>> alternative
>> mappings as part of your routine analysis: depending on the cost of
follow-up experiments, or downstream analysis, time should be spent
looking at probesets in details.
>> Hoping this helps,
>> L.
>> > Mark
>> >
>> >
>> >
>> > --
>> > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
>> > Indiana University School of Medicine
>> >
>> > 15032 Hunter Court, Westfield, IN 46074
>> >
>> > (317) 490-5129 Work, & Mobile & VoiceMail
>> > (317) 663-0513 Home (no voice mail please)
>> >
>> > **************************************************************
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at stat.math.ethz.ch
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>
>
> --
> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
>
> 15032 Hunter Court, Westfield, IN 46074
>
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 663-0513 Home (no voice mail please)
>
> ******************************************************************
>