[BioC] annotation of a probe in hgu133plus2
James W. MacDonald
jmacdon at uw.edu
Thu Jan 9 21:33:58 CET 2014
Hi Mayte,
Please don't take conversations off-list.
On Thursday, January 09, 2014 2:09:07 PM, Mayte Suarez-Farinas wrote:
> Hi Jim
>
> Thanks for your answer. i am using the annaffy functions to crete the
> the annotation tables. can you suggest a newer and easy package/set of
> functions
> to do this, avoiding getting a NA in probes with 2 entrezid??
Well, I can make some recommendations, but I don't have any easy
suggestions. The reason these are coming up as NA in the first place is
because it isn't clear what you are measuring with this probeset, or
any other probeset that interrogates more than one gene. In fact you
could have a scenario where one sample type is only expressing DEFB4A
and the other is only expressing DEFB4B, but you would think there
isn't a difference!
For this gene it may not matter - I don't know anything about this gene
- but whatever you do, you will have to make some assumptions about ALL
of the multiply-mapped probesets, and what might be reasonable for one
gene may not be reasonable at all for another gene.
That is one of the reasons that MBNI came up with the re-mapped cdfs,
so you can have some assurance that you are measuring the transcript
from a single gene without confounding. So that is one thing you could
consider, switching to the MBNI re-mapped cdf for this array. There are
caveats to using those cdfs as well (mainly that the number of probes
in a probeset varies widely, so the accuracy of your expression
estimate varies as well. However, when you make comparisons you don't
then take this variance into account, unless you use something like the
puma package).
These days I am using ReportingTools rather than annaffy. But just
switching still won't help, as ReportingTools uses similar methods to
annotate. Instead you will have to do some work yourself first. Let's
assume you are using limma to do the analysis, and you are creating an
MArrayLM object called 'fit2'.
fit <- lmFit(eset, design)
fit2 <- contrasts.fit(fit, contrast)
fit2 <- eBayes(fit2)
Now we can add annotations to the fit2 object so topTable() will append
the annotations.
gns <- select(hgu133plus2.db, featureNames(eset),
c("ENTREZID","SYMBOL","GENENAME"))
You can add other things as well. The problem now is that the gns
data.frame is not the right dimension because of the multiply-mapped
probesets. This is where the tough decisions come in. As I see it you
can:
1.) Subset gns to have just one row per probeset.
a.) by doing something really naive like taking just the first
instance of a probeset
b.) or by randomly selecting one of the multiple mappings
2.) Collapse multiple probesets so e.g. your favorite gene would now be
207356_at DEFB4A|DEFB4B 1673|100289462
and you wouldn't lose any information. However, it will be trickier to
make links to e.g., Entrez Gene for these cells if you care about those
things. If you do 1a or 1b (let's use 1a as an example), you can do:
library(affycoretools) ## shameless plug, I know
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
tab <- topTable(fit2, coef = 1, <other arguments go here>)
htab <- HTMLReport("coef1", "Data for contrast 1", reportDirectory =
"reports/")
publish(tab, htab, .modifyDF = list(entrezLinks, affyLinks))
finish(htab)
indx <- HTMLReport("index", "Main page")
publish(Link(htab), indx)
finish(indx)
And then you will have a file called index.html that you can double
click, and on that page will be a link you can use to see your data.
The link will bring up an HTML table and there will be links to netaffx
and Entrez Gene for the probeset IDs and Gene IDs (that's why you load
affycoretools; to get the entrezLinks and affyLinks functions).
If you want to go through door #2, you won't be able to make links to
Entrez Gene without creating a function of your own. Instead you could
forgo those links and just do
gnlst <- tapply(1:nrow(gns), gns$PROBEID, function(x) gns[x,])
gnlst <- lapply(gnlst, function(x) apply(x, 2, function(y)
paste(unique(y), collapse = " | ")))
gns <- do.call("rbind", gnlst)
fit2$genes <- gns
and everything else the same as above except no entrezLinks for
.modifyDF.
Best,
Jim
>
> thanks!!!
>
> Mayte Suarez-Farinas
> Research Assistant Professor
> Laboratory of Investigative Dermatology
> Biostatistician, Center for Clinical and Translational Science
> The Rockefeller University
> 1230 York Ave, Box 178
> New York, NY 10065
> Phone: +1(212) 327-8213
> Fax: +1(212) 327-8232
>
>
>
> On Jan 9, 2014, at 1:11 PM, James W. MacDonald wrote:
>
>> Hi Mayte,
>>
>> The gene isn't missing, instead this probeset interrogates two
>> different genes. Plus you are using really old methods to get the
>> data you want. These days you should use select().
>>
>> > select(hgu133plus2.db, "207356_at", c("SYMBOL","ENTREZID"))
>> PROBEID SYMBOL ENTREZID
>> 1 207356_at DEFB4A 1673
>> 2 207356_at DEFB4B 100289462
>> Warning message:
>> In .generateExtraRows(tab, keys, jointype) :
>> 'select' resulted in 1:many mapping between keys and return rows
>>
>> And you get a warning saying that this probeset maps to more than one
>> gene and symbol.
>>
>> Best,
>>
>> Jim
>>
>> On 1/9/2014 12:10 PM, Mayte Suarez-Farinas wrote:
>>> Dear annotation builders
>>>
>>> I notice that one of my favorite genes DEFB4A is missing from the
>>> hgu133plus2.db
>>> the NetAffy database says that probeset 207356_at correspond to gene
>>> DEFB4A
>>> yet getSYMBOL('207356_at','hgu133plus2.db') returns NA. This happen
>>> in newer versions of
>>> this package, since it has been tehre before.. This gene is
>>> extremely important in
>>> lots skin diseases, inflammation etc, so it is a concern for my
>>> colleagues
>>> Any help is appreciated,
>>>
>>> Mayte Suarez-Farinas
>>> Research Assistant Professor
>>> Laboratory of Investigative Dermatology
>>> Biostatistician, Center for Clinical and Translational Science
>>> The Rockefeller University
>>> 1230 York Ave, Box 178
>>> New York, NY 10065
>>> Phone: +1(212) 327-8213
>>> Fax: +1(212) 327-8232
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list