[BioC] How can I remove control probesets from the expressionset object in gene expression analysis with Affy Human Gene 1.0ST microarray

Tue Jun 21 22:37:50 CEST 2011

Hi Virginia,

On 6/21/2011 6:46 AM, Virginia Garcia wrote:
> Dear list,
>
> I am quite new to R as well as to microarray analysis.
> I am dealing with some gene expression analysis performed on Affymetrix Human
> Gene 1.0ST microarray.
>
> So far, I have learnt how to filtrate data using genefilter using nsFilter
> functions.
>
> Now, I would like to know how to filter out from the expressionset object all
> the control probesets (~4000) that Affymetrix includes in the microarray (for
> quality assay, normalization, background correction, etc.). However, none of
> the aforementioned functions worked for me.
>
> How can I recognize those probesets and remove them? I would like to filter
> them out before statistical analysis with limma package.

How much do you like database stuff? Lots? Great, I have some fun for you.

Assuming you have pd.hugene.1.0.st.v1 installed (I have 1.1 installed, 
but the queries will be the same).

 > library(pd.hugene.1.1.st.v1)

First, get a connection to the database

 > con <- db(pd.hugene.1.1.st.v1)

Now, what's in this thing?

 > dbListTables(con)
[1] "chrom_dict" "core_mps"   "featureSet" "level_dict" "pmfeature"
[6] "table_info" "type_dict"

OK, let's dig.

 > dbGetQuery(con, "select * from pmfeature limit 5;")
       fid  fsetid atom   x    y
1  704656 7892501    1 765  711
2 1060101 7892501    2 800 1070
3 1046459 7892501    3  28 1057
4  403586 7892501    4 655  407
5  473527 7892502    5 306  478

Boring.

 > dbGetQuery(con, "select * from featureSet limit 5;")
    fsetid strand start stop transcript_cluster_id exon_id crosshyb_type 
level
1 7892501     NA     0    0                     0       0             0 
    NA
2 7892502     NA     0    0                     0       0             0 
    NA
3 7892503     NA     0    0                     0       0             0 
    NA
4 7892504     NA     0    0                     0       0             0 
    NA
5 7892505     NA     0    0                     0       0             0 
    NA
   chrom type
1    NA    6
2    NA    7
3    NA    7
4    NA    7
5    NA    7

Maybe more interesting. What's this 'type' business?

 > dbGetQuery(con, "select * from type_dict limit 5;")
   type                   type_id
1    1                      main
2    2             control->affx
3    3             control->chip
4    4 control->bgp->antigenomic
5    5     control->bgp->genomic

Now that looks like some reasonable info. What different types are there?

 > dbGetQuery(con, "select * from type_dict;")
   type                   type_id
1    1                      main
2    2             control->affx
3    3             control->chip
4    4 control->bgp->antigenomic
5    5     control->bgp->genomic
6    6            normgene->exon
7    7          normgene->intron
8    8  rescue->FLmRNA->unmapped

So it looks like pretty much everything but type 1 are controls of some 
type.

 > tab <- dbGetQuery(con, "select * from featureSet;")
 > table(tab$type)

      1      2      4      6      7
253002     57     45   1195   2904

So that's about 4200 control probes (2,4,6,7).

How to subset from here depends on the package you are using for 
analysis (oligo, affy, xps), so I won't go into that. But you can now 
get the IDs of the probesets you care about and use them to filter.

Best,

Jim

>
> Thank you very much in advance for your help.
>
> Virginia.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues