[BioC] annotation of a probe in hgu133plus2

Thu Jan 9 21:33:58 CET 2014

Hi Mayte,

Please don't take conversations off-list.

On Thursday, January 09, 2014 2:09:07 PM, Mayte Suarez-Farinas wrote:
> Hi Jim
>
> Thanks for your answer. i am using the annaffy functions to crete the
> the annotation tables. can you suggest a newer and easy package/set of
> functions
> to do this, avoiding getting a NA in probes with 2 entrezid??

Well, I can make some recommendations, but I don't have any easy 
suggestions. The reason these are coming up as NA in the first place is 
because it isn't clear what you are measuring with this probeset, or 
any other probeset that interrogates more than one gene. In fact you 
could have a scenario where one sample type is only expressing DEFB4A 
and the other is only expressing DEFB4B, but you would think there 
isn't a difference!

For this gene it may not matter - I don't know anything about this gene 
- but whatever you do, you will have to make some assumptions about ALL 
of the multiply-mapped probesets, and what might be reasonable for one 
gene may not be reasonable at all for another gene.

That is one of the reasons that MBNI came up with the re-mapped cdfs, 
so you can have some assurance that you are measuring the transcript 
from a single gene without confounding. So that is one thing you could 
consider, switching to the MBNI re-mapped cdf for this array. There are 
caveats to using those cdfs as well (mainly that the number of probes 
in a probeset varies widely, so the accuracy of your expression 
estimate varies as well. However, when you make comparisons you don't 
then take this variance into account, unless you use something like the 
puma package).

These days I am using ReportingTools rather than annaffy. But just 
switching still won't help, as ReportingTools uses similar methods to 
annotate. Instead you will have to do some work yourself first. Let's 
assume you are using limma to do the analysis, and you are creating an 
MArrayLM object called 'fit2'.

fit <- lmFit(eset, design)
fit2 <- contrasts.fit(fit, contrast)
fit2 <- eBayes(fit2)

Now we can add annotations to the fit2 object so topTable() will append 
the annotations.

gns <- select(hgu133plus2.db, featureNames(eset), 
c("ENTREZID","SYMBOL","GENENAME"))

You can add other things as well. The problem now is that the gns 
data.frame is not the right dimension because of the multiply-mapped 
probesets. This is where the tough decisions come in. As I see it you 
can:

1.) Subset gns to have just one row per probeset.
     a.) by doing something really naive like taking just the first 
instance of a probeset
     b.) or by randomly selecting one of the multiple mappings
2.) Collapse multiple probesets so e.g. your favorite gene would now be

207356_at DEFB4A|DEFB4B 1673|100289462

and you wouldn't lose any information. However, it will be trickier to 
make links to e.g., Entrez Gene for these cells if you care about those 
things. If you do 1a or 1b (let's use 1a as an example), you can do:

library(affycoretools) ## shameless plug, I know
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns

tab <- topTable(fit2, coef = 1, <other arguments go here>)

htab <- HTMLReport("coef1", "Data for contrast 1", reportDirectory = 
"reports/")
publish(tab, htab, .modifyDF = list(entrezLinks, affyLinks))
finish(htab)
indx <- HTMLReport("index", "Main page")
publish(Link(htab), indx)
finish(indx)

And then you will have a file called index.html that you can double 
click, and on that page will be a link you can use to see your data. 
The link will bring up an HTML table and there will be links to netaffx 
and Entrez Gene for the probeset IDs and Gene IDs (that's why you load 
affycoretools; to get the entrezLinks and affyLinks functions).

If you want to go through door #2, you won't be able to make links to 
Entrez Gene without creating a function of your own. Instead you could 
forgo those links and just do

gnlst <- tapply(1:nrow(gns), gns$PROBEID, function(x) gns[x,])
gnlst <- lapply(gnlst, function(x) apply(x, 2, function(y) 
paste(unique(y), collapse = " | ")))
gns <- do.call("rbind", gnlst)

fit2$genes <- gns

and everything else the same as above except no entrezLinks for 
.modifyDF.

Best,

Jim

>
> thanks!!!
>
> Mayte Suarez-Farinas
> Research Assistant Professor
> Laboratory of Investigative Dermatology
> Biostatistician, Center for Clinical and Translational Science
> The Rockefeller University
> 1230 York Ave, Box 178
> New York, NY 10065
> Phone:  +1(212) 327-8213
> Fax:       +1(212) 327-8232
>
>
>
> On Jan 9, 2014, at 1:11 PM, James W. MacDonald wrote:
>
>> Hi Mayte,
>>
>> The gene isn't missing, instead this probeset interrogates two
>> different genes. Plus you are using really old methods to get the
>> data you want. These days you should use select().
>>
>> > select(hgu133plus2.db, "207356_at", c("SYMBOL","ENTREZID"))
>>    PROBEID SYMBOL  ENTREZID
>> 1 207356_at DEFB4A      1673
>> 2 207356_at DEFB4B 100289462
>> Warning message:
>> In .generateExtraRows(tab, keys, jointype) :
>>  'select' resulted in 1:many mapping between keys and return rows
>>
>> And you get a warning saying that this probeset maps to more than one
>> gene and symbol.
>>
>> Best,
>>
>> Jim
>>
>> On 1/9/2014 12:10 PM, Mayte Suarez-Farinas wrote:
>>> Dear annotation builders
>>>
>>> I notice that one of my favorite genes DEFB4A is missing from the
>>> hgu133plus2.db
>>> the NetAffy database says that probeset 207356_at correspond to gene
>>> DEFB4A
>>> yet getSYMBOL('207356_at','hgu133plus2.db') returns NA. This happen
>>> in newer versions of
>>> this package, since it has been tehre before.. This gene is
>>> extremely important in
>>> lots skin diseases, inflammation etc, so it is a concern for my
>>> colleagues
>>> Any help is appreciated,
>>>
>>> Mayte Suarez-Farinas
>>> Research Assistant Professor
>>> Laboratory of Investigative Dermatology
>>> Biostatistician, Center for Clinical and Translational Science
>>> The Rockefeller University
>>> 1230 York Ave, Box 178
>>> New York, NY 10065
>>> Phone:  +1(212) 327-8213
>>> Fax:       +1(212) 327-8232
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099