[BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix NEGATIVE CONTROL PROBES

James W. MacDonald jmacdon at uw.edu
Mon Jun 11 17:49:42 CEST 2012


Hi Rich,

On 6/11/2012 11:24 AM, Richard Friedman wrote:
> Jim and list,
>
>     Thank you for bringing the NCprobe option to my attention.
> I did not know that it had been implemented.
>
> Does anyone out there have  a list of negative control probes more
> Human st 1.0 and for Mouse st 1.0 ?

No, but it's easy enough to get.

 > library(pd.hugene.1.0.st.v1)
 > con <- db(pd.hugene.1.0.st.v1)
 > dbListTables(con)
[1] "chrom_dict" "core_mps"   "featureSet" "level_dict" "pmfeature"
[6] "table_info" "type_dict"
 > dbGetQuery(con, "select * from type_dict;")
   type                   type_id
1    1                      main
2    2             control->affx
3    3             control->chip
4    4 control->bgp->antigenomic
5    5     control->bgp->genomic
6    6            normgene->exon
7    7          normgene->intron
8    8  rescue->FLmRNA->unmapped

So let's say we want to call just bgp probes background. Now I happen to 
know we want the featureSet table, but you can look to see what is in 
each table using dbListFields()

 > dbListFields(con, "featureSet")
  [1] "fsetid"                "strand"                "start"
  [4] "stop"                  "transcript_cluster_id" "exon_id"
  [7] "crosshyb_type"         "level"                 "chrom"
[10] "type"

You can also do something like

dbGetQuery(con, "select * from featureSet limit 10;")

to get an idea what is in a given table. So to get the probesets we want,

 > x <-dbGetQuery(con, "select fsetid from featureSet where type in 
('4','5');")
 > head(x)
    fsetid
1 7892601
2 7892698
3 7892756
4 7892815
5 7892916
6 7892943

Now there may be a further complication that I don't have the time or 
desire to check out. These are probeset level IDs, and most people do 
things at the transcript level (and if you are doing gcrma() this is all 
you can do). So I don't know if you need to convert these fsetids to 
meta_fsetids, which are transcript level probesets. If so, you can map 
from fsetid to meta_fsetid using the core_mps table.

I leave doing that mapping up to you, grasshopper. As a further test of 
your SQL awesomeness, you could figure out how to use an inner join 
statement so you can get the meta_fsetids in one database query.

Knowing how to do that sort of thing can come in really handy - you can 
even link different .db packages to do cross-database queries, which can 
make your life much better if you need to do some complex mappings. 
There are some examples in one of the AnnotationDbi vignettes.

Best,

Jim




>
> Thanks and best wishes,
> Rich
>
>
> On Jun 11, 2012, at 11:18 AM, James W. MacDonald wrote:
>
>> Hi Suparna,
>>
>> On 6/11/2012 11:00 AM, suparna mitra wrote:
>>> Hi,
>>>  I am very new to biocondunctor and microaray. I have limited 
>>> experience
>>> with R.
>>> I am trying to use biocondunctor for analyzing HuGene-1_0-st-v1 
>>> microarray
>>> data. I selectected different normalization method (rma, gcrma and 
>>> mas5).
>>> For my data rma worked but for for gcrma and mas5 both I have problem.
>>
>> This is to be expected. The HuGene array is a PM-only design, so 
>> mas5() won't work (because the mas5 algorithm requires subtracting MM 
>> from PM, and there are no MM probes). In addition, the default for 
>> gcrma() is to estimate the background for probes based on the GC 
>> content, using bins of MM probes. Again, without any MM probes, this 
>> won't work.
>>
>> Note however that gcrma() has an 'NCprobe' argument that you can use 
>> to specify an index of negative control probes. This is a non-trivial 
>> thing to do, and may be beyond your abilities if you are very new to 
>> R and BioC.
>>
>> To get the index of these probes, you will need to decide which 
>> probes are negative control probes, and then you can use the 
>> indexProbes() function, passing a character vector of the negative 
>> control probes to the genenames argument. This will return a list of 
>> indices for each probeset that you can unlist() prior to feeding in 
>> to gcrma().
>>
>> Or you could just stick with rma(), like the vast majority of people do.
>>
>> Best,
>>
>> Jim
>>
>>
>>> For gcrma it gives me a error like: Computing affinitiesError:
>>> length(prlen) == 1 is not TRUE
>>>
>>> And for mas 5 it seems working but I get only a whole list of NA.
>>>
>>> Here is what I have done.
>>>
>>>> mydata<- ReadAffy()
>>>> mydata
>>> AffyBatch object
>>> size of arrays=1050x1050 features (16 kb)
>>> cdf=HuGene-1_0-st-v1 (32321 affyids)
>>> number of samples=18
>>> number of genes=32321
>>> annotation=hugene10stv1
>>>
>>>> eset<- rma(mydata)
>>> Background correcting
>>> Normalizing
>>> Calculating Expression
>>>> eset_justrma=justRMA()
>>>> eset_mas5<- mas5(mydata)
>>> background correction: mas
>>> PM/MM correction : mas
>>> expression values: mas
>>> background correcting...done.
>>> 32321 ids to be processed
>>> |                    |
>>> |####################|
>>>> eset_gcrma<- gcrma(mydata)
>>> Adjusting for optical effect..................Done.
>>> Computing affinitiesError: length(prlen) == 1 is not TRUE   Here is the
>>> error
>>>
>>>> eset_justrma # this worked fine
>>> ExpressionSet (storageMode: lockedEnvironment)
>>> assayData: 32321 features, 18 samples
>>>   element names: exprs, se.exprs
>>> protocolData
>>>   sampleNames: MC1_(HuGene-1_0-st-v1).CEL 
>>> MC10_(HuGene-1_0-st-v1).CEL ...
>>> MC9_(HuGene-1_0-st-v1).CEL (18 total)
>>>   varLabels: ScanDate
>>>   varMetadata: labelDescription
>>> phenoData
>>>   sampleNames: MC1_(HuGene-1_0-st-v1).CEL 
>>> MC10_(HuGene-1_0-st-v1).CEL ...
>>> MC9_(HuGene-1_0-st-v1).CEL (18 total)
>>>   varLabels: sample
>>>   varMetadata: labelDescription
>>> featureData: none
>>> experimentData: use 'experimentData(object)'
>>> Annotation: hugene10stv1
>>>> eset_mas5 # this seems worked fine but resulted all NA
>>> ExpressionSet (storageMode: lockedEnvironment)
>>> assayData: 32321 features, 18 samples
>>>   element names: exprs, se.exprs
>>> protocolData
>>>   sampleNames: MC1_(HuGene-1_0-st-v1).CEL 
>>> MC10_(HuGene-1_0-st-v1).CEL ...
>>> MC9_(HuGene-1_0-st-v1).CEL (18 total)
>>>   varLabels: ScanDate
>>>   varMetadata: labelDescription
>>> phenoData
>>>   sampleNames: MC1_(HuGene-1_0-st-v1).CEL 
>>> MC10_(HuGene-1_0-st-v1).CEL ...
>>> MC9_(HuGene-1_0-st-v1).CEL (18 total)
>>>   varLabels: sample
>>>   varMetadata: labelDescription
>>> featureData: none
>>> experimentData: use 'experimentData(object)'
>>> Annotation: hugene10stv1
>>>> write.exprs(eset_justrma,file="eset_justrma.csv")
>>>> write.exprs(eset_mas5,file="eset_mas5.csv")
>>>> write.exprs(eset,file="eset.csv")
>>> Any help in this will be really great. Being a novice, I am very 
>>> sorry if I
>>> am doing any silly mistake.
>>> Thanks a lot,
>>> Suparna.
>>>
>>
>> -- 
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list