[BioC] Issues about how to filter and annotate the MoGene-2_0-st and MoEx-1_0-st-v1 array probe sets

Thu Jun 12 16:55:59 CEST 2014

Hi Chao,

Please don't take conversations off-list (e.g., use 'Reply-all' when 
responding).

On 6/11/2014 5:02 PM, 张超 wrote:
> Hi Jim,
>
> Thanks a lot for your prompt reply and I really learned a lot from it.
> But I don't understand what do you mean as you said below,
>
>   >You will have to deal with those multiple mappings as you see fit.
> But after doing so, you can then do
>
>>fit$genes <- gns
>
>  >and your topTable object will then be populated with the annotations.
>
> My way to deal with those multiple mappings is to retain only the first
> one probe set of the multiple mappings(i.e. gns_rm_multmap).And I have
> finished it. But what to do next and what is fit$genes? Could you please
> explain it in more details? I have found a way to populated with the
> annotations with "for,if else" commands, and it costs too much time for
> my machine to run these commands. It seems that your way is much easier,
> but I haven't got it.

The MArrayLM object (your 'fit2' object below) has a slot for gene 
annotations, and as long as the annotation object is in the correct 
order you can just put it in there. So you have done something like this:

fit <- lmFit(eset, design)
fit2 <- contrasts.fit(fit, contrast)
fit2 <- eBayes(fit2)

And then if you annotate those genes

gns <- select(hugene10sttranscriptcluster.db, row.names(fit2$coef), 
c("ENTREZID","SYMBOL","GENENAME"))

and then remove duplicates like you said you did

gns <- gns[duplicated(gns[,1]),]

you can check first that things line up

all.equal(row.names(fit2$coef), gns[,1])

and if that is TRUE (it should be), you can now put those data into your 
fit2 object

fit2$genes <- gns

and then if you do

topTable(fit2, 1)

you will have all the annotation in the result as well.

And then you can do something like

idx <- HTMLReport("index", "Look at these results!")
publish(fit2, idx, eset, coef=1, pvalueCutoff = 0.05)
finish(idx)

and you will have an index.html file that you can open to look at your 
results. See the ReportingTools vignettes for more information.

>
> In addition, as you said, you would not recommend summarizing the oligo
> data at the probeset level. However if rma(target = "core") is used, how
> to use paCalls, and if paCalls("DABG") is used, how to use rma to make
> them to be consistent?

I have no idea. I don't use paCalls().

Best,

Jim

>
> Looking forward to your reply. Many thanks in advance..
>
> Best Regards
>
> Chao
> ‍
>
>
>
>
>
>
>
>
> At 2014-06-10 10:26:47, "James W. MacDonald" <jmacdon at uw.edu> wrote:
>>Hi Chao,
>>
>>On 6/8/2014 10:37 AM, 张超 wrote:
>>> Dear list,
>>>
>>>
>>> I would like to use the paCalls from oligo package for filtering probe sets with absence of transcripts. My data are from MoGene-2_0-st and MoEx-1_0-st-v1 array (Affymetrix). My data after reading CEL files is a GeneFeatureSet with 12 samples (6 for control groups, and 6 for experimental groups). What should I do with these data computed by paCalls(PSDABG) as below ?
>>>> library(oligo)
>>>> OligoRawData<-read.celfiles(CEL file lists)
>>>> eset<-rma(OligoRawData)
>>>> dagbPS <- paCalls(OligoRawData, "PSDABG")
>>> What to do next to filter the probe sets? Could you please send me a complete examples and a detailed explanation for it?
>>>
>>
>>You need to decide what constitutes 'present' and how many samples have
>>to be present in order to keep the probeset.
>>
>>So if I were to say that a p < 0.05 is present and I needed 20 such
>>samples, I could do
>>
>>keep <- rowSums(dagbPS < 0.05) > 19
>>eset <- eset[keep,]
>>
>>If the above code is mysterious to you, then you need to read 'An
>>Introduction to R'.
>>
>>>
>>> In addition, moex10sttranscriptcluster.db can be used for annotation of data from MoEx-1_0-st-v1 array, and both of mogene20stprobeset.db and mogene20sttranscriptcluster.db can be used for that of data from MoGene-2_0-st (including both of gene and lncRNA lists). But only more than half of the probe sets are anotated with gene symbols by below commands.
>>>> results<-decideTests(fit2, method="global", adjust.method="fdr", p.value=0.05, lfc=0.5) #DEGs determination by t tests
>>>> genesymbol = getText(aafSymbol(rownames(results), "moex10sttranscriptcluster.db" ));#annotated by moex10sttranscriptcluster.db for data get from MoEx-1_0-st-v1 array
>>> Only 1217 and 24709 can be annotated by mogene20stprobeset.db and mogene20sttranscriptcluster.db seperately for data of MoGene-2_0-st (length(genesymbol[which(genesymbol!="")])). But the total num is 41345 (length(results)). Only 14966 can be mapped by moex10sttranscriptcluster.db for data of  MoEx-1_0-st-v1 (total num is 23332 - length(results)). Should I need to add some more db for the annotation?
>>>
>>
>>The annotation packages with 'transcriptcluster' in their names are for
>>instances where you have summarized probesets at the transcript level
>>(which is the default for rma() in oligo). If you want to summarize at
>>the probeset level (which I would not recommend doing, btw), you need to
>>use target = "probeset" in your call to rma().
>>
>>In other words, you should only be using the transcriptcluster
>>annotation packages. Although please note that the
>>moex10transcriptcluster.db package is for the Mouse Exon 10 ST array,
>>not the Gene ST array.
>>
>>There are any number of reasons that only a subset of probesets on the
>>array have symbols. First, there are lots of controls, which won't have
>>gene symbols. Second, the lincRNA/snoRNA/miRNA probesets that Affy put
>>on these array won't have gene symbols either (because, they aren't
>>genes). Third, there is still some speculative content on these arrays;
>>things that might end up being genes, with gene names, in the future,
>>but which are just hypothetical at this point in time. Fourth, the
>>annaffy package uses the old style methods of getting annotations, in
>>which case any probeset that matches more than one gene symbol will be
>>masked.
>>
>>You will be much better served if you were to do something like
>>
>>gns <- select(mogene10sttranscriptcluster.db, featureNames(eset),
>>c("ENTREZID","SYMBOL","GENENAME"))
>>
>>Which will result in a warning that you have multiple mappings. You will
>>have to deal with those multiple mappings as you see fit. But after
>>doing so, you can then do
>>
>>fit$genes <- gns
>>
>>and your topTable object will then be populated with the annotations.
>>You might then consider using the ReportingTools package, which is under
>>active development and maintenance, rather than the annaffy package
>>which may still be actively maintained, but is no longer AFAICT under
>>active development.
>>
>>Best,
>>
>>Jim
>>
>>
>>
>>>
>>> BTW, I am a beginner of this field. I found there are too few documents for examples about how to use functions of oligo package. Could you please also give me some suggestions? Looking forword to your reply. I really appreciate for your any helps.
>>>
>>>
>>> Thanks again.
>>>
>>>
>>> Best regards.
>>>
>>>
>>> Chao
>>>
>>>
>>>
>>>
>>> 	[[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>--
>>James W. MacDonald, M.S.
>>Biostatistician
>>University of Washington
>>Environmental and Occupational Health Sciences
>>4225 Roosevelt Way NE, # 100
>>Seattle WA 98105-6099
>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099