[BioC] makecdfenv and multi-mapping probes on Mouse Gene 1.0 ST array

Sean Davis sdavis2 at mail.nih.gov
Sun May 11 16:56:22 CEST 2008


On Sun, May 11, 2008 at 10:38 AM, Heidi Dvinge <heidi at ebi.ac.uk> wrote:
>
> On 11 May 2008, at 14:28, Sean Davis wrote:
>
>> On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at ebi.ac.uk> wrote:
>>>
>>> Dear  all,
>>>
>>> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used
>>> the makecdfenv package to build a cdf environment based on the file
>>> MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage.
>>>
>>> That worked without any problems, but out of curiosity I tried taking
>>> a closer look at the format of the array, to see how many probes were
>>> in each probe set etc.
>>>
>>> I'm aware that some probes map to multiple probe sets and are removed
>>> when the cdfenv is produced, which seems to be the case for about 8%
>>> of the probes. My question is exactly how this happens? I would
>>> expect the multiple-mapping probes to be removed from all probe sets,
>>> but this doesn't seem to be the case.
>>
>> I believe that the probes are kept in the first or last probeset (not
>> sure which) seen.  Someone with a little more affy experience can
>> comment more fully.
>>
> I figured it was probably something along those lines, but what's the reason
> for not just removing them completely, instead of keeping them in a 'random'
> probe set? Most probes that map multiple times map to > 2 probe sets. And in
> some cases it's large chunks of probe sets that 'overlap', whereas in other
> cases it's just a few or a single probe that 'jumps around'.

I think this probe "removal" is a side effect of the way the original
affy package and affy chips were designed.  Before these newer arrays,
there were no probes that mapped to multiple probe sets, so there was
never a mechanism for "removing" probes or even maintain multiple
mappings.  So, the current behavior is due to the fact that there is
not a way to maintain the many-to-many mapping, if I understand it
correctly and is not really in any particular way optimal.  Again,
someone with more affy experience might have more to say.

Sean


>>> Example with the two overlapping probe sets 10344719 and 10353008,
>>> where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned
>>> into only tab-delimited info and read into R, and "INDEX" being a
>>> unique probe identifier (the same as index-1 in the cdf env):
>>>
>>>> cdf[cdf$QUAL=="10344719","INDEX"]
>>>
>>>  [1]    7543  661828  575792  962890  963940  140756  337977
>>> 510591  860722  968182  387524  386474
>>> [13]  385518  384468 1076441 1075391  850724   51881  957657  100610
>>> 862535  506651  505601   82272
>>> [25]   83322  692860  691810  494417  932343  689216  836826  894914
>>> 715393  421443   92496  485600
>>> [37]  253868  352083  594288 1049892  370822  369772  416675  928371
>>> 505790  506840  135781
>>>>
>>>> cdf[cdf$QUAL=="10353008","INDEX"]
>>>
>>>  [1]  506840  505790  928371  416675  369772  370822 1049892
>>> 485600   92496  421443  715393  894914
>>> [13] 1073586  110809  836826  689216  932343  494417  691810
>>> 83322   82272  505601  506651  862535
>>> [25]  100610  957657   51881  850724 1075391 1076441  384468  385518
>>> 386474  387524  968182  860722
>>> [37]  510591  337977  140756  963940  962890  575792  661828    7543
>>>>
>>>> indexProbes(raw, genenames="10344719")
>>>
>>> $`10344719`
>>> [1] 692861 253869 352084 594289 135782
>>>>
>>>> indexProbes(raw, genenames="10353008")
>>>
>>> $`10353008`
>>>  [1]  506841  505791  928372  416676  369773  370823 1049893
>>> 485601   92497  421444  715394  894915
>>> [13] 1073587  110810  836827  689217  932344  494418  691811
>>> 83323   82273  505602  506652  862536
>>> [25]  100611  957658   51882  850725 1075392 1076442  384469  385519
>>> 386475  387525  968183  860723
>>> [37]  510592  337978  140757  963941  962891  575793  661829    7544
>>>
>>> So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of
>>> which are overlapping. In the cdf environment  10344719 appears to
>>> have the 42 overlapping probes removed, but they're still present in
>>> 10353008.
>>>
>>> A similar situation is seen for e.g. the overlapping probe sets
>>> 10461391 and 10487930 with 41 probes each, 40 of which are identical:
>>>
>>>> cdf[cdf$QUAL=="10461391","INDEX"]
>>>
>>>  [1]  483268 1022846  409057  703153  328783  372162  882399
>>> 569942  765746  868615  948367  413614
>>> [13]  830931  434763  970910  600221  599171  135798    6746  455659
>>> 799186  912319  469313  145393
>>> [25]  872191  126758  801051  774196  773146  965810  272742   19445
>>> 585800  999188 1012776  823868
>>> [37]  156514  210874  645037  799505 1075142
>>>>
>>>> cdf[cdf$QUAL=="10487930","INDEX"]
>>>
>>>  [1] 1075142  799505  645037  210874  156514  823868 1012776
>>> 999188  585800   19445  272742  965810
>>> [13]  773146  774196  801051  126758  872191  145393  469313  912319
>>> 799186  839098    6746  135798
>>> [25]  599171  600221  970910  434763  830931  413614  948367  868615
>>> 765746  569942  882399  372162
>>> [37]  328783  703153  409057 1022846  483268
>>>>
>>>> indexProbes(raw, genenames="10461391")
>>>
>>> $`10461391`
>>> [1] 455660
>>>>
>>>> indexProbes(raw, genenames="10487930")
>>>
>>> $`10487930`
>>>  [1] 1075143  799506  645038  210875  156515  823869 1012777
>>> 999189  585801   19446  272743  965811
>>> [13]  773147  774197  801052  126759  872192  145394  469314  912320
>>> 799187  839099    6747  135799
>>> [25]  599172  600222  970911  434764  830932  413615  948368  868616
>>> 765747  569943  882400  372163
>>> [37]  328784  703154  409058 1022847  483269
>>>
>>> Any comments on this or on exactly how the cdf environment is created
>>> would be much appreciated.
>>>
>>> Thanks
>>> \Heidi
>>>
>>>> sessionInfo()
>>>
>>> R version 2.7.0 Under development (unstable) (2008-02-12 r44439)
>>> i386-apple-darwin8.10.1
>>>
>>> locale:
>>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>>
>>> attached base packages:
>>> [1] tools     stats     graphics  grDevices utils     datasets
>>> methods   base
>>>
>>> other attached packages:
>>> [1] makecdfenv_1.17.0    affy_1.17.3          preprocessCore_1.1.5
>>> affyio_1.7.17
>>> [5] Biobase_1.99.4
>>>
>>>
>>> ------------<<>>------------
>>> Heidi Dvinge
>>>
>>> EMBL-European Bioinformatics Institute
>>> Wellcome Trust Genome Campus
>>> Hinxton, Cambridge
>>> CB10 1SD
>>> Mail: heidi at ebi.ac.uk
>>> Phone: +44 (0) 1223 494 444
>>> ------------<<>>------------
>>>
>>>
>>>
>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list