[BioC] makecdfenv and multi-mapping probes on Mouse Gene 1.0 ST array

Sun May 11 15:28:47 CEST 2008

On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at ebi.ac.uk> wrote:
> Dear  all,
>
> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used
> the makecdfenv package to build a cdf environment based on the file
> MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage.
>
> That worked without any problems, but out of curiosity I tried taking
> a closer look at the format of the array, to see how many probes were
> in each probe set etc.
>
> I'm aware that some probes map to multiple probe sets and are removed
> when the cdfenv is produced, which seems to be the case for about 8%
> of the probes. My question is exactly how this happens? I would
> expect the multiple-mapping probes to be removed from all probe sets,
> but this doesn't seem to be the case.

I believe that the probes are kept in the first or last probeset (not
sure which) seen.  Someone with a little more affy experience can
comment more fully.

Sean

> Example with the two overlapping probe sets 10344719 and 10353008,
> where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned
> into only tab-delimited info and read into R, and "INDEX" being a
> unique probe identifier (the same as index-1 in the cdf env):
>
>  > cdf[cdf$QUAL=="10344719","INDEX"]
>  [1]    7543  661828  575792  962890  963940  140756  337977
> 510591  860722  968182  387524  386474
> [13]  385518  384468 1076441 1075391  850724   51881  957657  100610
> 862535  506651  505601   82272
> [25]   83322  692860  691810  494417  932343  689216  836826  894914
> 715393  421443   92496  485600
> [37]  253868  352083  594288 1049892  370822  369772  416675  928371
> 505790  506840  135781
>  > cdf[cdf$QUAL=="10353008","INDEX"]
>  [1]  506840  505790  928371  416675  369772  370822 1049892
> 485600   92496  421443  715393  894914
> [13] 1073586  110809  836826  689216  932343  494417  691810
> 83322   82272  505601  506651  862535
> [25]  100610  957657   51881  850724 1075391 1076441  384468  385518
> 386474  387524  968182  860722
> [37]  510591  337977  140756  963940  962890  575792  661828    7543
>  > indexProbes(raw, genenames="10344719")
> $`10344719`
> [1] 692861 253869 352084 594289 135782
>  > indexProbes(raw, genenames="10353008")
> $`10353008`
>  [1]  506841  505791  928372  416676  369773  370823 1049893
> 485601   92497  421444  715394  894915
> [13] 1073587  110810  836827  689217  932344  494418  691811
> 83323   82273  505602  506652  862536
> [25]  100611  957658   51882  850725 1075392 1076442  384469  385519
> 386475  387525  968183  860723
> [37]  510592  337978  140757  963941  962891  575793  661829    7544
>
> So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of
> which are overlapping. In the cdf environment  10344719 appears to
> have the 42 overlapping probes removed, but they're still present in
> 10353008.
>
> A similar situation is seen for e.g. the overlapping probe sets
> 10461391 and 10487930 with 41 probes each, 40 of which are identical:
>
>  > cdf[cdf$QUAL=="10461391","INDEX"]
>  [1]  483268 1022846  409057  703153  328783  372162  882399
> 569942  765746  868615  948367  413614
> [13]  830931  434763  970910  600221  599171  135798    6746  455659
> 799186  912319  469313  145393
> [25]  872191  126758  801051  774196  773146  965810  272742   19445
> 585800  999188 1012776  823868
> [37]  156514  210874  645037  799505 1075142
>  > cdf[cdf$QUAL=="10487930","INDEX"]
>  [1] 1075142  799505  645037  210874  156514  823868 1012776
> 999188  585800   19445  272742  965810
> [13]  773146  774196  801051  126758  872191  145393  469313  912319
> 799186  839098    6746  135798
> [25]  599171  600221  970910  434763  830931  413614  948367  868615
> 765746  569942  882399  372162
> [37]  328783  703153  409057 1022846  483268
>  > indexProbes(raw, genenames="10461391")
> $`10461391`
> [1] 455660
>  > indexProbes(raw, genenames="10487930")
> $`10487930`
>  [1] 1075143  799506  645038  210875  156515  823869 1012777
> 999189  585801   19446  272743  965811
> [13]  773147  774197  801052  126759  872192  145394  469314  912320
> 799187  839099    6747  135799
> [25]  599172  600222  970911  434764  830932  413615  948368  868616
> 765747  569943  882400  372163
> [37]  328784  703154  409058 1022847  483269
>
> Any comments on this or on exactly how the cdf environment is created
> would be much appreciated.
>
> Thanks
> \Heidi
>
>  > sessionInfo()
> R version 2.7.0 Under development (unstable) (2008-02-12 r44439)
> i386-apple-darwin8.10.1
>
> locale:
> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] tools     stats     graphics  grDevices utils     datasets
> methods   base
>
> other attached packages:
> [1] makecdfenv_1.17.0    affy_1.17.3          preprocessCore_1.1.5
> affyio_1.7.17
> [5] Biobase_1.99.4
>
>
> ------------<<>>------------
> Heidi Dvinge
>
> EMBL-European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton, Cambridge
> CB10 1SD
> Mail: heidi at ebi.ac.uk
> Phone: +44 (0) 1223 494 444
> ------------<<>>------------
>
>
>
>
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>